Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review

Abstract

Parallel corpora play an important role in training machine translation (MT)models, particularly for low-resource languages where high-quality bilingualdata is scarce. This review provides a comprehensive overview of availableparallel corpora for Indic languages, which span diverse linguistic families,scripts, and regional variations. We categorize these corpora intotext-to-text, code-switched, and various categories of multimodal datasets,highlighting their significance in the development of robust multilingual MTsystems. Beyond resource enumeration, we critically examine the challengesfaced in corpus creation, including linguistic diversity, script variation,data scarcity, and the prevalence of informal textual content.We also discussand evaluate these corpora in various terms such as alignment quality anddomain representativeness. Furthermore, we address open challenges such as dataimbalance across Indic languages, the trade-off between quality and quantity,and the impact of noisy, informal, and dialectal data on MT performance.Finally, we outline future directions, including leveraging cross-lingualtransfer learning, expanding multilingual datasets, and integrating multimodalresources to enhance translation quality. To the best of our knowledge, thispaper presents the first comprehensive review of parallel corpora specificallytailored for low-resource Indic languages in the context of machinetranslation.