Abstract
The advent of Large Language Models (LLM) has reformed the Automatic SpeechRecognition (ASR). Prompting LLM with audio embeddings to generatetranscriptions becomes the new state-of-the-art ASR. Despite LLMs being trainedwith an extensive amount of text corpora, high-quality domain-specific textdata can still significantly enhance ASR performance on domain adaptationtasks. Although LLM-based ASR can naturally incorporate more text corpora byfine-tuning the LLM decoder, fine-tuning such ASR on text-only data withoutpaired prompts may diminish the effectiveness of domain-specific knowledge. Tomitigate this issue, we propose a two-step soft prompt fine-tuning strategythat enhances domain-specific text adaptation. Experimental results show thattext adaptation with our proposed method achieved a relative up to 9% WordError Rate (WER) reduction and up to 18% Entity Error Rate (EER) reduction onthe target domain compared to the baseline ASR. Combining this withdomain-specific Language Model (LM) fusion can further improve the EER by arelative 2-5%