LexMatcher: Dictionary-centric Data Curation for LLM-based Machine Translation

Yongjing Yin, Jiali Zeng, Yafu Li, Fandong Meng, Yue Zhang

Abstract

The fine-tuning of open-source large language models (LLMs) for machine translation has recently received considerable attention, marking a shift towards data-centric research from traditional neural machine translation. However, the area of data collection for instruction fine-tuning in machine translation remains relatively underexplored. In this paper, we present LexMatcher, a simple yet effective method for data curation,the design of which is driven by the coverage of senses found in bilingual dictionaries. The construction process comprises data retrieval from an existing corpus and data augmentation that supplements the infrequent senses of polysemous words. Utilizing LLaMA2 as our base model, our method outperforms the established baselines on the WMT2022 test sets and also exhibits remarkable performance in tasks related to word sense disambiguation and specialized terminology translation. Our method is also applicable to other pre-trained models, and complements the method of continual pre-training using monolingual data, demonstrating the effectiveness of LexMatcher in enhancing LLM-based machine translation.

Anthology ID:: 2024.findings-emnlp.866
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14767–14779
Language:
URL:: https://aclanthology.org/2024.findings-emnlp.866
DOI:
Bibkey:
Cite (ACL):: Yongjing Yin, Jiali Zeng, Yafu Li, Fandong Meng, and Yue Zhang. 2024. LexMatcher: Dictionary-centric Data Curation for LLM-based Machine Translation. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 14767–14779, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: LexMatcher: Dictionary-centric Data Curation for LLM-based Machine Translation (Yin et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-emnlp.866.pdf

PDF Cite Search