How dialectal pretraining improves Arabic automatic speech recognition

MBZUAI · Notable

Summary

MBZUAI researchers presented a study at ACL 2024 on improving Arabic ASR by pre-training on dialectal Arabic. They trained three versions of the ArTST model: one on MSA, one on MSA and dialectal data, and one on MSA, dialectal, and multilingual data. Results showed that pre-training on dialectal Arabic improves ASR performance across MSA and various dialects. Why it matters: This research addresses a key challenge in Arabic NLP, given the diversity and lack of standardization in dialects, which could lead to more accurate speech recognition systems.

Keywords

Arabic · speech recognition · dialects · MBZUAI · ArTST

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition

arXiv · Jun 5

This paper benchmarks the performance of OpenAI's Whisper model on diverse Arabic speech recognition tasks, using publicly available data and novel dialect evaluation sets. The study explores zero-shot, few-shot, and full finetuning scenarios. Results indicate that while Whisper outperforms XLS-R models in zero-shot settings on standard datasets, its performance drops significantly when applied to unseen Arabic dialects.

AlcLaM: Arabic Dialectal Language Model

arXiv · Jul 18

The paper introduces AlcLaM, an Arabic dialectal language model trained on 3.4M sentences from social media. AlcLaM expands the vocabulary and retrains a BERT-based model, using only 13GB of dialectal text. Despite the smaller training data, AlcLaM outperforms models like CAMeL, MARBERT, and ArBERT on various Arabic NLP tasks. Why it matters: AlcLaM offers a more efficient and accurate approach to Arabic NLP by focusing on dialectal Arabic, which is often underrepresented in existing models.

On the importance of Data Scale in Pretraining Arabic Language Models

arXiv · Jan 15

This paper studies the impact of data scale on Arabic Pretrained Language Models (PLMs). Researchers retrained BERT-base and T5-base models on large Arabic corpora, achieving state-of-the-art results on the ALUE and ORCA benchmarks. The analysis indicates that pretraining data volume is the most important factor for performance. Why it matters: This work provides valuable insights into building effective Arabic language models, emphasizing the importance of large, high-quality datasets for advancing Arabic NLP.