[go: up one dir, main page]

Skip to content

This repository contains resources from the paper A Survey of Large Language Models for Arabic Language and its Dialects

Notifications You must be signed in to change notification settings

iwan-rg/ArabicLLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 

Repository files navigation

A Survey of Large Language Models for Arabic Language and its Dialects

This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.

Aspects of the survey

Arabic LLM Survey

Geographic Distribution and Development of Arabic LLMs. Model names with the same color indicate collaborative development efforts between different countries.

Geographic Distribution and Development of Arabic LLMs. Model names with the same color indicate collaborative development efforts between different countries.

Datasets for Arabic LLMs Pretraining

Classical Arabic (CA)

  • OpenITI corpus (v1.2) Link.

Modern Standard Arabic (MSA)

  • 1.5 billion Words Arabic Corpus Link.
  • OSIAN Corpus Link.
  • Gigaword Corpus Link.
  • Oscar Corpus Link.
  • Arabic Wikipedia Dump Link.
  • ArabicText 2022 Link.
  • AraC4 Link.
  • Maktabah Link.
  • TyDi dataset Link.
  • ARCD Link.
  • CC100-Arabic Link.
  • OpenSubtitles2016 corpus Link.
  • AraNews Link.
  • Hindawi Link.

Dialectal Arabic (DA)

  • CALLHOME Egyptian Arabic Transcripts Link.
  • Babylon Levantine Arabic Transcripts Link.
  • Levantine Arabic QT Training Data Set 4 Transcripts Link.
  • Levantine Arabic QT Training Data Set 5 Transcripts Link.
  • Gulf Arabic Conversational Telephone Transcripts Link.
  • Iraqi Arabic Conversational Telephone Transcripts Link.
  • Levantine Arabic Conversational Telephone Transcripts Link.
  • Fisher Levantine Arabic Conversational Telephone Transcripts Link.
  • AOC Dataset Link.
  • Arabic-Dialect/English Parallel Text Link.
  • PADIC Corpus Link.
  • Curras Corpus Link.
  • BOLT Egyptian Arabic SMS/Chat and Transliteration Link.
  • SDC (Shami Dialect Corpus) Link.
  • Gumar Corpus Link.
  • MADAR Corpus Link.
  • Habibi Corpus Link.
  • NADI 2020 Corpus Link.
  • QADI Corpus Link.
  • Darija-SFT-Mixture dataset Link.

Monolingual Arabic LLMs

Bilingual Arabic LLMS

Multilingual Arabic LLMS

Citation

Please cite our paper if you use it in your work:

BibTeX

@misc{mashaabi2024survey,
      title={A Survey of Large Language Models for Arabic Language and its Dialects}, 
      author={Malak Mashaabi and Shahad Al-Khalifa and Hend Al-Khalifa},
      year={2024},
      institution={iWAN Research Group, College of Computer and Information Sciences, King Saud University},
      url={https://arxiv.org/abs/2410.20238}, 
}

About

This repository contains resources from the paper A Survey of Large Language Models for Arabic Language and its Dialects

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published