ViT5

A pretrained Transformer-based encoder-decoder model for the Vietnamese language. With T5-style self-supervised pretraining, ViT5 is trained on a large corpus of high-quality and diverse Vietnamese texts. We benchmark ViT5 on two downstream text generation tasks, Abstractive Text Summarization and Named Entity Recognition. All the experiments are shown in our paper ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation

Pretrained Models

Vocabulary: ViT5_vocab

Model	Gin File Location	Checkpoint Location	🤗 HuggingFace Model
ViT5-Base	ViT5_base.gin	gs://vietai_public/viT5/ViT5_base/checkpoint_1000000	ViT5-Base-1024 (1M)
ViT5-Large	ViT5_large.gin	gs://vietai_public/viT5/ViT5_large/checkpoint_1500000	ViT5-Large-1024 (1.5M)

Finetunning

📄 Example with Flaxformer: finetune_vit5x_example.ipynb

📄 Example with Hugging Face: finetune_huggingface_example.ipynb

Results

Example

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("VietAI/vit5-large-vietnews-summarization")  
model = AutoModelForSeq2SeqLM.from_pretrained("VietAI/vit5-large-vietnews-summarization")
model.to("cuda")

sentence = "VietAI là tổ chức phi lợi nhuận với sứ mệnh ươm mầm tài năng về trí tuệ nhân tạo và xây dựng một cộng đồng các chuyên gia trong lĩnh vực trí tuệ nhân tạo đẳng cấp quốc tế tại Việt Nam."
text =  "vietnews: " + sentence + " </s>"
encoding = tokenizer(text, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")
outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=256,
    early_stopping=True
)
for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(line)

Load our pretrained models on HuggingFace

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Base
tokenizer = AutoTokenizer.from_pretrained("VietAI/vit5-base")  
model = AutoModelForSeq2SeqLM.from_pretrained("VietAI/vit5-base")

# Large
tokenizer = AutoTokenizer.from_pretrained("VietAI/vit5-large")  
model = AutoModelForSeq2SeqLM.from_pretrained("VietAI/vit5-large")

Datasets

Finetuning

Abstractive Text Summarization

For easily reproducing our results, we provide the ViT5 checkpoint finetuned on vietnews as well. You can directly use our model on HuggingFace 🤗.

Citation

@inproceedings{phan-etal-2022-vit5,
    title = "{V}i{T}5: Pretrained Text-to-Text Transformer for {V}ietnamese Language Generation",
    author = "Phan, Long and Tran, Hieu and Nguyen, Hieu and Trinh, Trieu H.",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop",
    year = "2022",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-srw.18",
    pages = "136--142",
}

Acknowledgements

We would like to thank Google for the support of Cloud credits and TPU quota!

Name		Name	Last commit message	Last commit date
Latest commit History 202 Commits
configs		configs
data		data
examples		examples
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViT5

Pretrained Models

Finetunning

Results

Example

Datasets

Finetuning

Abstractive Text Summarization

Citation

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

vietai/ViT5

Folders and files

Latest commit

History

Repository files navigation

ViT5

Pretrained Models

Finetunning

Results

Example

Datasets

Finetuning

Abstractive Text Summarization

Citation

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages