Abstract
Building on the success of large language models (LLMs), recent advancementssuch as GPT-4o have enabled real-time speech interactions through LLM-basedvoice assistants, offering a significantly improved user experience compared totraditional text-based interactions. However, the absence of benchmarksdesigned to evaluate these speech interaction capabilities has hinderedprogress of LLM-based voice assistants development. Current evaluations focusprimarily on automatic speech recognition (ASR) or general knowledge evaluationwith clean speeches, neglecting the more intricate, real-world scenarios thatinvolve diverse speaker characteristics, environmental and content factors. Toaddress this, we introduce VoiceBench, the first benchmark designed to providea multi-faceted evaluation of LLM-based voice assistants. VoiceBench alsoincludes both real and synthetic spoken instructions that incorporate the abovethree key real-world variations. Extensive experiments reveal the limitationsof current LLM-based voice assistant models and offer valuable insights forfuture research and development in this field.