Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

Abstract

In multimodal sentiment analysis, collecting text data is often morechallenging than video or audio due to higher annotation costs and inconsistentautomatic speech recognition (ASR) quality. To address this challenge, ourstudy has developed a robust model that effectively integrates multimodalsentiment information, even in the absence of text modality. Specifically, wehave developed a Double-Flow Self-Distillation Framework, including UnifiedModality Cross-Attention (UMCA) and Modality Imagination Autoencoder (MIA),which excels at processing both scenarios with complete modalities and thosewith missing text modality. In detail, when the text modality is missing, ourframework uses the LLM-based model to simulate the text representation from theaudio modality, while the MIA module supplements information from the other twomodalities to make the simulated text representation similar to the real textrepresentation. To further align the simulated and real representations, and toenable the model to capture the continuous nature of sample orders in sentimentvalence regression tasks, we have also introduced the Rank-N Contrast (RNC)loss function. When testing on the CMU-MOSEI, our model achieved outstandingperformance on MAE and significantly outperformed other models when textmodality is missing. The code is available at:https://github.com/WarmCongee/SDUMC