Evaluation of Small Language Models for Arabic Language Processing

arXiv · June 19, 2026 · Significant research

Summary

A new paper evaluated twelve Small Language Models (SLMs) on Arabic natural language processing tasks, utilizing a benchmark of 240 Arabic test items across eight domains and ten language skills. The models were assessed in a zero-shot setting, with responses scored using a multi-model LLM-as-a-judge framework involving GPT-4.1 Mini, Claude Haiku 4.5, and DeepSeek-Chat. Gemma 3 (12B) achieved the highest overall score (4.548/5), followed by Aya and C4AI Command Arabic, with results suggesting that strong Arabic alignment and instruction-following are crucial for performance. Why it matters: This benchmark offers a standardized method for evaluating compact Arabic language models, guiding future development towards more efficient, reliable, and culturally relevant Arabic AI systems.

Keywords

Small Language Models · Arabic NLP · benchmark · evaluation · Gemma

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models

arXiv · Mar 17

This paper explores the impact of tokenization strategies and vocabulary sizes on Arabic language model performance across NLP tasks like news classification and sentiment analysis. It compares four tokenizers, finding that Byte Pair Encoding (BPE) with Farasa performs best overall due to its morphological analysis capabilities. The study surprisingly found limited impact of vocabulary size on performance with fixed model sizes, challenging assumptions about vocabulary size and model performance. Why it matters: The findings provide insights for developing more effective and nuanced Arabic language models, particularly for handling dialectal variations and promoting responsible AI development in the region.

On the importance of Data Scale in Pretraining Arabic Language Models

arXiv · Jan 15

This paper studies the impact of data scale on Arabic Pretrained Language Models (PLMs). Researchers retrained BERT-base and T5-base models on large Arabic corpora, achieving state-of-the-art results on the ALUE and ORCA benchmarks. The analysis indicates that pretraining data volume is the most important factor for performance. Why it matters: This work provides valuable insights into building effective Arabic language models, emphasizing the importance of large, high-quality datasets for advancing Arabic NLP.

LAraBench: Benchmarking Arabic AI with Large Language Models

arXiv · May 24

LAraBench introduces a benchmark for Arabic NLP and speech processing, evaluating LLMs like GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM. The benchmark covers 33 tasks across 61 datasets, using zero-shot and few-shot learning techniques. Results show that SOTA models generally outperform LLMs in zero-shot settings, though larger LLMs with few-shot learning reduce the gap. Why it matters: This benchmark helps assess and improve the performance of LLMs on Arabic language tasks, highlighting areas where specialized models still excel.

N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition

arXiv · Jun 5

This paper benchmarks the performance of OpenAI's Whisper model on diverse Arabic speech recognition tasks, using publicly available data and novel dialect evaluation sets. The study explores zero-shot, few-shot, and full finetuning scenarios. Results indicate that while Whisper outperforms XLS-R models in zero-shot settings on standard datasets, its performance drops significantly when applied to unseen Arabic dialects.