Abstract
We introduce the Speak \& Improve Corpus 2025, a dataset of L2 learnerEnglish data with holistic scores and language error annotation, collected fromopen (spontaneous) speaking tests on the Speak \& Improve learning platformhttps://speakandimprove.com . The aim of the corpus release is to address amajor challenge to developing L2 spoken language processing systems, the lackof publicly available data with high-quality annotations. It is being madeavailable for non-commercial use on the ELiT website. In designing this corpuswe have sought to make it cover a wide-range of speaker attributes, from theirL1 to their speaking ability, as well as providing manual annotations. Thisenables a range of language-learning tasks to be examined, such as assessingspeaking proficiency or providing feedback on grammatical errors in a learner'sspeech. Additionally, the data supports research into the underlying technologyrequired for these tasks including automatic speech recognition (ASR) of lowresource L2 learner English, disfluency detection or spoken grammatical errorcorrection (GEC). The corpus consists of around 340 hours of L2 Englishlearners audio with holistic scores, and a subset of audio annotated withtranscriptions and error labels.