Abstract
Multimodal Aspect-based Sentiment Analysis (MABSA) enhances sentimentdetection by integrating textual data with complementary modalities, such asimages, to provide a more refined and comprehensive understanding of sentiment.However, conventional attention mechanisms, despite notable benchmarks, arehindered by quadratic complexity, limiting their ability to fully captureglobal contextual dependencies and rich semantic information in bothmodalities. To address this limitation, we introduce DualKanbaFormer, a novelframework that leverages parallel Textual and Visual KanbaFormer modules forrobust multimodal analysis. Our approach incorporates Aspect-Driven SparseAttention (ADSA) to dynamically balance coarse-grained aggregation andfine-grained selection for aspect-focused precision, ensuring the preservationof both global context awareness and local precision in textual and visualrepresentations. Additionally, we utilize the Selective State Space Model(Mamba) to capture extensive global semantic information across bothmodalities. Furthermore, We replace traditional feed-forward networks andnormalization with Kolmogorov-Arnold Networks (KANs) and Dynamic Tanh (DyT) toenhance non-linear expressivity and inference stability. To facilitate theeffective integration of textual and visual features, we design a multimodalgated fusion layer that dynamically optimizes inter-modality interactions,significantly enhancing the models efficacy in MABSA tasks. Comprehensiveexperiments on two publicly available datasets reveal that DualKanbaFormerconsistently outperforms several state-of-the-art (SOTA) models.