Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision

Abstract

There exist three approaches for multilingual and crosslingual automaticspeech recognition (MCL-ASR) - supervised pretraining with phonetic orgraphemic transcription, and self-supervised pretraining. We find thatpretraining with phonetic supervision has been underappreciated so far forMCL-ASR, while conceptually it is more advantageous for information sharingbetween different languages. This paper explores the approach of pretrainingwith weakly phonetic supervision towards data-efficient MCL-ASR, which iscalled Whistle. We relax the requirement of gold-standard human-validatedphonetic transcripts, and obtain International Phonetic Alphabet (IPA) basedtranscription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models.We construct a common experimental setup based on the CommonVoice dataset,called CV-Lang10, with 10 seen languages and 2 unseen languages. A set ofexperiments are conducted on CV-Lang10 to compare, as fair as possible, thethree approaches under the common setup for MCL-ASR. Experiments demonstratethe advantages of phoneme-based models (Whistle) for MCL-ASR, in terms ofspeech recognition for seen languages, crosslingual performance for unseenlanguages with different amounts of few-shot data, overcoming catastrophicforgetting, and training efficiency. It is found that when training data ismore limited, phoneme supervision can achieve better results compared tosubword supervision and self-supervision, thereby providing higherdata-efficiency. To support reproducibility and promote future research alongthis direction, we release the code, models and data for the entire pipeline ofWhistle at https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10.