Abstract
The objective of multimodal intent recognition (MIR) is to leverage variousmodalities-such as text, video, and audio-to detect user intentions, which iscrucial for understanding human language and context in dialogue systems.Despite advances in this field, two main challenges persist: (1) effectivelyextracting and utilizing semantic information from robust textual features; (2)aligning and fusing non-verbal modalities with verbal ones effectively. Thispaper proposes a Text Enhancement with CommOnsense Knowledge Extractor (TECO)to address these challenges. We begin by extracting relations from bothgenerated and retrieved knowledge to enrich the contextual information in thetext modality. Subsequently, we align and integrate visual and acousticrepresentations with these enhanced text features to form a cohesive multimodalrepresentation. Our experimental results show substantial improvements overexisting baseline methods.