Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages

Abstract

Automatic speech recognition systems have undoubtedly advanced with theintegration of multilingual and multitask models such as Whisper, which haveshown a promising ability to understand and process speech across a wide rangeof languages. Despite their robustness, these models often fall short inhandling the linguistic distinctions of minority languages. This studyaddresses this gap by integrating traditional and novel language models withfine-tuned Whisper models to raise their performance in less commonly studiedlanguages. Through rigorous fine-tuning and evaluation across multipledatasets, we demonstrate substantial improvements in word error rate,particularly in low-resource scenarios. Our approach not only does takeadvantage of the extensive data Whisper was pre-trained on, but alsocomplements its linguistic adaptability by incorporating language models. Weobtained improvements up to 51\% for in-distribution datasets and up to 34\%for out-of-distribution sentences using statistical language models, whilelarge language models provided moderate but consistently robust improvementacross diverse linguistic contexts. The findings reveal that, while theintegration reliably benefits all model sizes, the extent of improvementvaries, highlighting the importance of optimized language model parameters.Finally, we emphasize the importance of selecting appropriate evaluationparameters when reporting the results using transformer-based ASR models. Insummary, this research clears the way for more inclusive ASR technologies thatperform better across languages by enriching their linguistic knowledge. Forfurther implementation details of this study, the technical documentation andsource code are available at http://www.github.com/hitz-zentroa/whisper-lm.