LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

Abstract

Recent video large language models (Video LLMs) often depend on costly humanannotations or proprietary model APIs (e.g., GPT-4o) to produce training data,which limits their training at scale. In this paper, we explore large-scaletraining for Video LLM with cheap automatic speech recognition (ASR)transcripts. Specifically, we propose a novel streaming training approach thatdensely interleaves the ASR words and video frames according to theirtimestamps. Compared to previous studies in vision-language representation withASR, our method naturally fits the streaming characteristics of ASR, thusenabling the model to learn temporally-aligned, fine-grained vision-languagemodeling. To support the training algorithm, we introduce a data productionpipeline to process YouTube videos and their closed captions (CC, same as ASR),resulting in Live-CC-5M dataset for pre-training and Live-WhisperX-526K datasetfor high-quality supervised fine-tuning (SFT). Remarkably, even without SFT,the ASR-only pre-trained LiveCC-7B-Base model demonstrates competitive generalvideo QA performance and exhibits a new capability in real-time videocommentary. To evaluate this, we carefully design a new LiveSports-3Kbenchmark, using LLM-as-a-judge to measure the free-form commentary.Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72Bmodels (Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B) in commentary quality evenworking in a real-time mode. Meanwhile, it achieves state-of-the-art results atthe 7B/8B scale on popular video QA benchmarks such as VideoMME and OVOBench,demonstrating the broad generalizability of our approach. All resources of thispaper have been released at https://showlab.github.io/livecc.