ViPubmed: Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation

Overview

Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English, such as Vietnamese. In this paper, we use a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained and supervised data in the biomedical domains. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts.

📝 Paper

📝 Blog Post

Methods

We large scale translate 20M Pubmed Abstract from English to Vietnamese and pretrained a biomedical Encoder-Decoder model on this translated dataset.

1. Pretrained Models (ViPubmedT5)

Vocabulary: ViT5_vocab

Model	Gin File Location	Checkpoint Location	Domain	Pretraining Corpus
ViPubmedT5 Base	ViT5_base.gin	gs://vietai_public/vipubmedt5_base/checkpoint_1500000	Biomedical	Translated ViPubmed

2. Finetunning

Finetunning example with T5X and Flaxformer: finetunning_vipubmedt5_example.ipynb

3. Released Datasets

ViMedNLI: A Natural Language Inference Dataset For The Vietnamese Clinical Domain
ViPubmed: 20M Vietnamese Biomedical abstracts generated by large scale translation

Citation

If you find our work helpful, please cite the following:

@misc{vipubmed,
  doi = {10.48550/ARXIV.2210.05598},
  url = {https://arxiv.org/abs/2210.05598},
  author = {Phan, Long and Dang, Tai and Tran, Hieu and Trinh, Trieu H. and Phan, Vy and Chau, Lam D. and Luong, Minh-Thang},
  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

Acknowledgment

We would like to thank the Google TPU Research Cloud (TRC) program and Soonson Kwon (Google ML Ecosystem programs Lead) for their supports.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
configs		configs
data		data
examples		examples
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
result1.png		result1.png
result2.png		result2.png
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViPubmed: Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation

Overview

Methods

We large scale translate 20M Pubmed Abstract from English to Vietnamese and pretrained a biomedical Encoder-Decoder model on this translated dataset.

1. Pretrained Models (ViPubmedT5)

2. Finetunning

Finetunning example with T5X and Flaxformer: finetunning_vipubmedt5_example.ipynb

3. Released Datasets

Citation

Acknowledgment

About

Contributors 2

Languages

License

vietai/ViPubmed

Folders and files

Latest commit

History

Repository files navigation

ViPubmed: Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation

Overview

Methods

We large scale translate 20M Pubmed Abstract from English to Vietnamese and pretrained a biomedical Encoder-Decoder model on this translated dataset.

1. Pretrained Models (ViPubmedT5)

2. Finetunning

Finetunning example with T5X and Flaxformer: finetunning_vipubmedt5_example.ipynb

3. Released Datasets

Citation

Acknowledgment

About

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages