VALLR: Visual ASR Language Model for Lip Reading

Abstract

Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complextask requiring the interpretation of spoken language exclusively from visualcues, primarily lip movements and facial expressions. This task is especiallychallenging due to the absence of auditory information and the inherentambiguity when visually distinguishing phonemes that have overlapping visemeswhere different phonemes appear identical on the lips. Current methodstypically attempt to predict words or characters directly from these visualcues, but this approach frequently encounters high error rates due tocoarticulation effects and viseme ambiguity. We propose a novel two-stage,phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) thataddresses these longstanding challenges. First, our model predicts a compactsequence of phonemes from visual inputs using a Video Transformer with a CTChead, thereby reducing the task complexity and achieving robust speakerinvariance. This phoneme output then serves as the input to a fine-tuned LargeLanguage Model (LLM), which reconstructs coherent words and sentences byleveraging broader linguistic context. Unlike existing methods that eitherpredict words directly-often faltering on visually similar phonemes-or rely onlarge-scale multimodal pre-training, our approach explicitly encodesintermediate linguistic structure while remaining highly data efficient. Wedemonstrate state-of-the-art performance on two challenging datasets, LRS2 andLRS3, where our method achieves significant reductions in Word Error Rate (WER)achieving a SOTA WER of 18.7 on LRS3 despite using 99.4% less labelled datathan the next best approach.