Skip to content
GCC AI Research

Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks

arXiv · · Significant research

Summary

Researchers introduce Swan, a family of Arabic-centric embedding models including Swan-Small (based on ARBERTv2) and Swan-Large (based on ArMistral). They also propose ArabicMTEB, a benchmark suite for cross-lingual, multi-dialectal Arabic text embedding performance across 8 tasks and 94 datasets. Swan-Large achieves state-of-the-art results, outperforming Multilingual-E5-large in most Arabic tasks. Why it matters: The new models and benchmarks address a critical need for high-quality Arabic language models that are both dialectally and culturally aware, enabling more effective NLP applications in the region.

Get the weekly digest

Top AI stories from the GCC region, every week.