SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation

Abstract

Simultaneous speech translation (SST) outputs translations in parallel withstreaming speech input, balancing translation quality and latency. While largelanguage models (LLMs) have been extended to handle the speech modality,streaming remains challenging as speech is prepended as a prompt for the entiregeneration process. To unlock LLM streaming capability, this paper proposesSimulS2S-LLM, which trains speech LLMs offline and employs a test-time policyto guide simultaneous inference. SimulS2S-LLM alleviates the mismatch betweentraining and inference by extracting boundary-aware speech prompts that allowsit to be better matched with text input data. SimulS2S-LLM achievessimultaneous speech-to-speech translation (Simul-S2ST) by predicting discreteoutput speech tokens and then synthesising output speech using a pre-trainedvocoder. An incremental beam search is designed to expand the search space ofspeech token prediction without increasing latency. Experiments on the CVSSspeech data show that SimulS2S-LLM offers a better translation quality-latencytrade-off than existing methods that use the same training data, such asimproving ASR-BLEU scores by 3 points at similar latency.