LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images

Abstract

The success of modern machine learning, particularly in facial translationnetworks, is highly dependent on the availability of high-quality, paired,large-scale datasets. However, acquiring sufficient data is often challengingand costly. Inspired by the recent success of diffusion models in high-qualityimage synthesis and advancements in Large Language Models (LLMs), we propose anovel framework called LLM-assisted Paired Image Generation (LaPIG). Thisframework enables the construction of comprehensive, high-quality pairedvisible and thermal images using captions generated by LLMs. Our methodencompasses three parts: visible image synthesis with ArcFace embedding,thermal image translation using Latent Diffusion Models (LDMs), and captiongeneration with LLMs. Our approach not only generates multi-view paired visibleand thermal images to increase data diversity but also produces high-qualitypaired data while maintaining their identity information. We evaluate ourmethod on public datasets by comparing it with existing methods, demonstratingthe superiority of LaPIG.