The https://github.com/AI4Bharat/IndicTrans2 project used larger corpus to train machine translation model for Indian languages. From a quick reading of code, it uses similar architecture of NLLB. From my testing using the demo site https://models.ai4bharat.org/#/nmt/v2 found that the results are better than NLLB. The grammar of sentences in translation looks better.
Since MinT supports multiple backend models. and IndicTrans2 looks like a compatible model, explore this opportunity.
The following languages are supported:
- Assamese (as/asm_Beng)
- Bangla (bn/ben_Beng)
- Bodo (brx/brx_Deva) No wiki yet
- Dogri (doi/doi_Deva) No wiki yet
- English (en/eng_Latn)
- Goan (gom/gom_Deva)
- Gujarati (gu/guj_Gujr)
- Hindi (hi/hin_Deva)
- Kannada (kn/kan_Knda)
- Kashmiri (ks/kas_Arab & kas_Deva)
- Maithili (mai/mai_Deva)
- Malayalam (ml/mal_Mlym)
- Manipuri (mni/mni_Beng & mni_Mtei)
- Marathi (mr/mar_Deva)
- Nepali (ne/npi_Deva)
- Oriya (or/ory_Orya)
- Panjabi (pa/pan_Guru)
- Sanskrit (sa/san_Deva)
- Santali (sat/sat_Olck)
- Sindhi (sd/snd_Arab & snd_Deva)
- Tamil (ta/tam_Taml)
- Telugu (te/tel_Telu)
- Urdu (ur/urd_Arab)
Expanded capabilities to support translations for all combinations of Indic languages (not just from/to English) is covered in T352690: Evaluate the integration of the new IndicTrans model (IndicTrans2-M2M) into MinT