A Survey of Large Language Models for Arabic Language and its Dialects

This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.

Aspects of the survey

Geographic Distribution and Development of Arabic LLMs. Model names with the same color indicate collaborative development efforts between different countries.

Datasets for Arabic LLMs Pretraining

Classical Arabic (CA)

OpenITI corpus (v1.2) Link.

Modern Standard Arabic (MSA)

1.5 billion Words Arabic Corpus Link.
OSIAN Corpus Link.
Gigaword Corpus Link.
Oscar Corpus Link.
Arabic Wikipedia Dump Link.
ArabicText 2022 Link.
AraC4 Link.
Maktabah Link.
TyDi dataset Link.
ARCD Link.
CC100-Arabic Link.
OpenSubtitles2016 corpus Link.
AraNews Link.
Hindawi Link.

Dialectal Arabic (DA)

CALLHOME Egyptian Arabic Transcripts Link.
Babylon Levantine Arabic Transcripts Link.
Levantine Arabic QT Training Data Set 4 Transcripts Link.
Levantine Arabic QT Training Data Set 5 Transcripts Link.
Gulf Arabic Conversational Telephone Transcripts Link.
Iraqi Arabic Conversational Telephone Transcripts Link.
Levantine Arabic Conversational Telephone Transcripts Link.
Fisher Levantine Arabic Conversational Telephone Transcripts Link.
AOC Dataset Link.
Arabic-Dialect/English Parallel Text Link.
PADIC Corpus Link.
Curras Corpus Link.
BOLT Egyptian Arabic SMS/Chat and Transliteration Link.
SDC (Shami Dialect Corpus) Link.
Gumar Corpus Link.
MADAR Corpus Link.
Habibi Corpus Link.
NADI 2020 Corpus Link.
QADI Corpus Link.
Darija-SFT-Mixture dataset Link.

Monolingual Arabic LLMs

AraBERT Link.
MARBERT Link.
ARBERT Link.
QARiB Link.
SudaBERT Link.
AraELECTRA Link.
AraGPT2 Link.
CAMeLBERT Link.
JABER Link.
SABER Link.
AraBART Link.
AraLegal-BERT Link.
AraRoBERTa Link.
DziriBERT Link.
TunBERT Link.
DarijaBERT Link.
AraMUS Link.
MorRoBERTa Link.
MorrBERT Link.
JASMINE Link.
AraQA Link.
ArabianGPT Link.
AraPOEMBERT Link.
SaudiBERT Link.
AlcLaM Link.
AraStories Link.
EgyBERT Link.
Atlas-Chat Link.

Bilingual Arabic LLMS

GigaBERT Link.
JAIS Link.
AceGPT Link.
ALLaM Link.

Multilingual Arabic LLMS

ArabicBERT Link.
AraT5 Link.

Citation

Please cite our paper if you use it in your work:

BibTeX

@misc{mashaabi2024survey,
      title={A Survey of Large Language Models for Arabic Language and its Dialects}, 
      author={Malak Mashaabi and Shahad Al-Khalifa and Hend Al-Khalifa},
      year={2024},
      institution={iWAN Research Group, College of Computer and Information Sciences, King Saud University},
      url={https://arxiv.org/abs/2410.20238}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Arabic LLM Survey.png		Arabic LLM Survey.png
LLM Map.png		LLM Map.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Survey of Large Language Models for Arabic Language and its Dialects

Aspects of the survey

Geographic Distribution and Development of Arabic LLMs. Model names with the same color indicate collaborative development efforts between different countries.

Datasets for Arabic LLMs Pretraining

Classical Arabic (CA)

Modern Standard Arabic (MSA)

Dialectal Arabic (DA)

Monolingual Arabic LLMs

Bilingual Arabic LLMS

Multilingual Arabic LLMS

Citation

About

Releases

Packages

Contributors 2

iwan-rg/ArabicLLMs

Folders and files

Latest commit

History

Repository files navigation

A Survey of Large Language Models for Arabic Language and its Dialects

Aspects of the survey

Geographic Distribution and Development of Arabic LLMs. Model names with the same color indicate collaborative development efforts between different countries.

Datasets for Arabic LLMs Pretraining

Classical Arabic (CA)

Modern Standard Arabic (MSA)

Dialectal Arabic (DA)

Monolingual Arabic LLMs

Bilingual Arabic LLMS

Multilingual Arabic LLMS

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages