Abstract
Multimodal Sentiment Analysis (MSA) endeavors to understand human sentimentby leveraging language, visual, and acoustic modalities. Despite the remarkableperformance exhibited by previous MSA approaches, the presence of inherentmultimodal heterogeneities poses a challenge, with the contribution ofdifferent modalities varying considerably. Past research predominantly focusedon improving representation learning techniques and feature fusion strategies.However, many of these efforts overlooked the variation in semantic richnessamong different modalities, treating each modality uniformly. This approach maylead to underestimating the significance of strong modalities whileoveremphasizing the importance of weak ones. Motivated by these insights, weintroduce a Text-oriented Cross-Attention Network (TCAN), emphasizing thepredominant role of the text modality in MSA. Specifically, for each multimodalsample, by taking unaligned sequences of the three modalities as inputs, weinitially allocate the extracted unimodal features into a visual-text and anacoustic-text pair. Subsequently, we implement self-attention on the textmodality and apply text-queried cross-attention to the visual and acousticmodalities. To mitigate the influence of noise signals and redundant features,we incorporate a gated control mechanism into the framework. Additionally, weintroduce unimodal joint learning to gain a deeper understanding of homogeneousemotional tendencies across diverse modalities through backpropagation.Experimental results demonstrate that TCAN consistently outperformsstate-of-the-art MSA methods on two datasets (CMU-MOSI and CMU-MOSEI).