Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English, such as Vietnamese. In this paper, we use a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained and supervised data in the biomedical domains. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts.
📝 Paper
We large scale translate 20M Pubmed Abstract from English to Vietnamese and pretrained a biomedical Encoder-Decoder model on this translated dataset.
Vocabulary: ViT5_vocab
Model | Gin File Location | Checkpoint Location | Domain | Pretraining Corpus |
---|---|---|---|---|
ViPubmedT5 Base | ViT5_base.gin | gs://vietai_public/vipubmedt5_base/checkpoint_1500000 | Biomedical | Translated ViPubmed |
Finetunning example with T5X and Flaxformer: finetunning_vipubmedt5_example.ipynb
- ViMedNLI: A Natural Language Inference Dataset For The Vietnamese Clinical Domain
- ViPubmed: 20M Vietnamese Biomedical abstracts generated by large scale translation
If you find our work helpful, please cite the following:
@misc{vipubmed,
doi = {10.48550/ARXIV.2210.05598},
url = {https://arxiv.org/abs/2210.05598},
author = {Phan, Long and Dang, Tai and Tran, Hieu and Trinh, Trieu H. and Phan, Vy and Chau, Lam D. and Luong, Minh-Thang},
keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}
We would like to thank the Google TPU Research Cloud (TRC) program and Soonson Kwon (Google ML Ecosystem programs Lead) for their supports.