Abstract
Incorporating visual modalities to assist Automatic Speech Recognition (ASR)tasks has led to significant improvements. However, existing Audio-VisualSpeech Recognition (AVSR) datasets and methods typically rely solely onlip-reading information or speaking contextual video, neglecting the potentialof combining these different valuable visual cues within the speaking context.In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS,comprising 100 hours of speech, video, and corresponding manual transcription,with the visual modality encompassing both lip-reading information and thepresentation slides used by the speaker. Based on Chinese-LiPS, we develop asimple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading andpresentation slide information as visual modalities for AVSR tasks. Experimentsshow that lip-reading and presentation slide information improve ASRperformance by approximately 8\% and 25\%, respectively, with a combinedperformance improvement of about 35\%. The dataset is available athttps://kiri0824.github.io/Chinese-LiPS/