The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models

Go Inoue, Bashar Alhafni, Nurpeiis Baimukan, Houda Bouamor, Nizar Habash

Abstract

In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.

Anthology ID:: 2021.wanlp-1.10
Original:: 2021.wanlp-1.10v1
Version 2:: 2021.wanlp-1.10v2
Volume:: Proceedings of the Sixth Arabic Natural Language Processing Workshop
Month:: April
Year:: 2021
Address:: Kyiv, Ukraine (Virtual)
Editors:: Nizar Habash, Houda Bouamor, Hazem Hajj, Walid Magdy, Wajdi Zaghouani, Fethi Bougares, Nadi Tomeh, Ibrahim Abu Farha, Samia Touileb
Venue:: WANLP
SIG:: SIGARAB
Publisher:: Association for Computational Linguistics
Note:
Pages:: 92–104
Language:
URL:: https://aclanthology.org/2021.wanlp-1.10
DOI:
Bibkey:
Cite (ACL):: Go Inoue, Bashar Alhafni, Nurpeiis Baimukan, Houda Bouamor, and Nizar Habash. 2021. The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 92–104, Kyiv, Ukraine (Virtual). Association for Computational Linguistics.
Cite (Informal):: The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models (Inoue et al., WANLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.wanlp-1.10.pdf
Code: CAMeL-Lab/CAMeLBERT
Data: ASTD, OSCAR

PDF (v2) PDF (v1) Cite Search Code