Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks
arXiv · · Significant research
Summary
Researchers introduce Swan, a family of Arabic-centric embedding models including Swan-Small (based on ARBERTv2) and Swan-Large (based on ArMistral). They also propose ArabicMTEB, a benchmark suite for cross-lingual, multi-dialectal Arabic text embedding performance across 8 tasks and 94 datasets. Swan-Large achieves state-of-the-art results, outperforming Multilingual-E5-large in most Arabic tasks. Why it matters: The new models and benchmarks address a critical need for high-quality Arabic language models that are both dialectally and culturally aware, enabling more effective NLP applications in the region.
Keywords
Arabic NLP · embedding models · cross-lingual · multi-dialectal · ArabicMTEB
Get the weekly digest
Top AI stories from the GCC region, every week.