Explore using IndicTrans2 - better model supporting 22 Indic languages
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	santhosh
	May 29 2023, 5:16 AM

Description

The https://github.com/AI4Bharat/IndicTrans2 project used larger corpus to train machine translation model for Indian languages. From a quick reading of code, it uses similar architecture of NLLB. From my testing using the demo site https://models.ai4bharat.org/#/nmt/v2 found that the results are better than NLLB. The grammar of sentences in translation looks better.

Since MinT supports multiple backend models. and IndicTrans2 looks like a compatible model, explore this opportunity.

The following languages are supported:

Assamese (as/asm_Beng)
Bangla (bn/ben_Beng)
Bodo (brx/brx_Deva) No wiki yet
Dogri (doi/doi_Deva) No wiki yet
English (en/eng_Latn)
Goan (gom/gom_Deva)
Gujarati (gu/guj_Gujr)
Hindi (hi/hin_Deva)
Kannada (kn/kan_Knda)
Kashmiri (ks/kas_Arab & kas_Deva)
Maithili (mai/mai_Deva)
Malayalam (ml/mal_Mlym)
Manipuri (mni/mni_Beng & mni_Mtei)
Marathi (mr/mar_Deva)
Nepali (ne/npi_Deva)
Oriya (or/ory_Orya)
Panjabi (pa/pan_Guru)
Sanskrit (sa/san_Deva)
Santali (sat/sat_Olck)
Sindhi (sd/snd_Arab & snd_Deva)
Tamil (ta/tam_Taml)
Telugu (te/tel_Telu)
Urdu (ur/urd_Arab)

P49467 Languages supported by IndicTrans2 (3-letter ISO codes)

1	asm
2	ben
3	hin
4	kas
5	sat
6	gom
7	guj
8	kan
9	mai
10	mal
11	mni
12	mar
13	npi
14	ory
15	pan
16	san
17	snd
18	tam
19	tel
20	urd
21	brx
22	doi

Expanded capabilities to support translations for all combinations of Indic languages (not just from/to English) is covered in T352690: Evaluate the integration of the new IndicTrans model (IndicTrans2-M2M) into MinT

Details

Subject	Repo	Branch	Lines +/-
Update MinT to 2023-06-13-061519-production	operations/deployment-charts	master	+1 -1
indictrans2 performance improvements	mediawiki/services/machinetranslation	master	+13 -21
Add IndicTrans2 support	mediawiki/services/machinetranslation	master	+294 -15

Customize query in gerrit

Related Objects

Mentioned In: T352690: Evaluate the integration of the new IndicTrans model (IndicTrans2-M2M) into MinT
T341050: Analyze activity levels for communities supported only by MinT
T339896: Enable MinT for all languages supported by IndicTrans2
T336683: Enable MinT support for languages with no Wikipedia yet
T334465: MinT: Detect language of source content automatically
Mentioned Here: T352690: Evaluate the integration of the new IndicTrans model (IndicTrans2-M2M) into MinT
T339896: Enable MinT for all languages supported by IndicTrans2
P49467 Languages supported by IndicTrans2 (3-letter ISO codes)
T334465: MinT: Detect language of source content automatically

Event Timeline

santhosh created this task.May 29 2023, 5:16 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 29 2023, 5:16 AM

Pginer-WMF moved this task from Backlog to Adding languages on the MinT board.Jun 1 2023, 8:05 AM

Pginer-WMF updated the task description. (Show Details)Jun 7 2023, 9:58 AM

Pginer-WMF triaged this task as Medium priority.Jun 7 2023, 10:04 AM

Pginer-WMF updated the task description. (Show Details)

Pginer-WMF added a project: Language-Team (Language-2023-April-June).

Pginer-WMF moved this task from Quarter Backlog to Priority: Unified Content & Section Translation on the Language-Team (Language-2023-April-June) board.

KartikMistry subscribed.Jun 7 2023, 11:55 AM

Change 928008 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/machinetranslation@master] Add IndicTrans2 support

https://gerrit.wikimedia.org/r/928008

gerritbot added a project: Patch-For-Review.Jun 12 2023, 10:35 AM

As per https://github.com/AI4Bharat/IndicTrans2/issues/6 there is no indic to indic translation support in these models. The upstream demo uses indic->en->indic multi step translation to achieve indic->indic translation.

Change 928008 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] Add IndicTrans2 support

https://gerrit.wikimedia.org/r/928008

Maintenance_bot removed a project: Patch-For-Review.Jun 12 2023, 1:10 PM

Change 929438 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/machinetranslation@master] indictrans2 performance improvements

https://gerrit.wikimedia.org/r/929438

gerritbot added a project: Patch-For-Review.Jun 13 2023, 5:22 AM

Change 929439 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update MinT to 2023-06-12-125157-production

https://gerrit.wikimedia.org/r/929439

Change 929438 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] indictrans2 performance improvements

https://gerrit.wikimedia.org/r/929438

Change 929439 merged by jenkins-bot:

[operations/deployment-charts@master] Update MinT to 2023-06-13-061519-production

https://gerrit.wikimedia.org/r/929439

KartikMistry moved this task from Priority: Unified Content & Section Translation to Check after deployment on the Language-Team (Language-2023-April-June) board.Jun 13 2023, 7:01 AM

Mentioned in SAL (#wikimedia-operations) [2023-06-13T07:09:24Z] <kart_> Updated MinT to 2023-06-13-061519-production (T337656, T334465)

Stashbot mentioned this in T334465: MinT: Detect language of source content automatically.Jun 13 2023, 7:09 AM

Maintenance_bot removed a project: Patch-For-Review.Jun 13 2023, 7:10 AM

KartikMistry assigned this task to santhosh.Jun 13 2023, 7:58 AM

Pginer-WMF mentioned this in T336683: Enable MinT support for languages with no Wikipedia yet.Jun 13 2023, 8:06 AM

Pginer-WMF mentioned this in T339896: Enable MinT for all languages supported by IndicTrans2.Jun 20 2023, 9:15 AM

KartikMistry moved this task from Check after deployment to Needs QA on the Language-Team (Language-2023-April-June) board.Jun 22 2023, 4:00 AM

Pginer-WMF updated the task description. (Show Details)Jun 22 2023, 8:21 AM

Pginer-WMF edited projects, added Language-Team (Language-2023-July-September); removed Language-Team (Language-2023-April-June).Jun 30 2023, 11:43 AM

Pginer-WMF moved this task from Quarter Backlog to Needs QA on the Language-Team (Language-2023-July-September) board.

Pginer-WMF mentioned this in T341050: Analyze activity levels for communities supported only by MinT.Jul 4 2023, 10:37 AM

After verifying languages supported by IndicTrans2 are working in the context of Content Translation in T339896, we can resolve this.

Restricted Application added a subscriber: Anoop. · View Herald TranscriptJul 21 2023, 9:55 AM

Pginer-WMF mentioned this in T352690: Evaluate the integration of the new IndicTrans model (IndicTrans2-M2M) into MinT.Dec 11 2023, 10:42 AM