Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts

Abstract

Automatic Speech Recognition (ASR) transcripts exhibit recognition errors andvarious spoken language phenomena such as disfluencies, ungrammaticalsentences, and incomplete sentences, hence suffering from poor readability. Toimprove readability, we propose a Contextualized Spoken-to-Written conversion(CoS2W) task to address ASR and grammar errors and also transfer the informaltext into the formal style with content preserved, utilizing contexts andauxiliary information. This task naturally matches the in-context learningcapabilities of Large Language Models (LLMs). To facilitate comprehensivecomparisons of various LLMs, we construct a document-level Spoken-to-Writtenconversion of ASR Transcripts Benchmark (SWAB) dataset. Using SWAB, we studythe impact of different granularity levels on the CoS2W performance, andpropose methods to exploit contexts and auxiliary information to enhance theoutputs. Experimental results reveal that LLMs have the potential to excel inthe CoS2W task, particularly in grammaticality and formality, our methodsachieve effective understanding of contexts and auxiliary information by LLMs.We further investigate the effectiveness of using LLMs as evaluators and findthat LLM evaluators show strong correlations with human evaluations on rankingsof faithfulness and formality, which validates the reliability of LLMevaluators for the CoS2W task.