Abstract
Visual language models like Contrastive Language-Image Pretraining (CLIP)have shown impressive performance in analyzing natural images with languageinformation. However, these models often encounter challenges when applied tospecialized domains such as remote sensing due to the limited availability ofimage-text pairs for training. To tackle this issue, we introduce DiffCLIP, anovel framework that extends CLIP to effectively convey comprehensivelanguage-driven semantic information for accurate classification ofhigh-dimensional multimodal remote sensing images. DiffCLIP is a few-shotlearning method that leverages unlabeled images for pretraining. It employsunsupervised mask diffusion learning to capture the distribution of diversemodalities without requiring labels. The modality-shared image encoder mapsmultimodal data into a unified subspace, extracting shared features withconsistent parameters across modalities. A well-trained image encoder furtherenhances learning by aligning visual representations with class-label textinformation from CLIP. By integrating these approaches, DiffCLIP significantlyboosts CLIP performance using a minimal number of image-text pairs. We evaluateDiffCLIP on widely used high-dimensional multimodal datasets, demonstrating itseffectiveness in addressing few-shot annotated classification tasks. DiffCLIPachieves an overall accuracy improvement of 10.65% across three remote sensingdatasets compared with CLIP, while utilizing only 2-shot image-text pairs. Thecode has been released at https://github.com/icey-zhang/DiffCLIP.