Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

arXiv · June 16, 2026 · Significant research

NLP Arabic AI Research Infrastructure Digital Humanities

Summary

This paper presents a methodology for digitizing and encoding the Al-Mawrid Arabic-English dictionary using the ISO Lexical Markup Framework (LMF) and TEI Lex-0 guidelines. The research resolves structural ambiguities and inconsistencies, achieving a structural parsing accuracy of 91% and high precision/recall for information extraction, such as 85% precision for synonyms. It also discusses limitations of TEI Lex-0 for Arabic phenomena and explores Linguistic Linked Open Data (LLOD) integration. Why it matters: This work provides a crucial, standardized computational lexicon for Arabic, addressing a significant gap in Arabic lexical infrastructure and offering a reproducible workflow for retro-digitization efforts in Arabic NLP and Digital Humanities.

Keywords

Al-Mawrid · Arabic-English dictionary · ISO LMF · TEI Lex-0 · Arabic NLP

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

An Accurate Arabic Root-Based Lemmatizer for Information Retrieval Purposes

arXiv · Mar 15

This paper introduces a new non-statistical Arabic lemmatizer algorithm designed for information retrieval systems. The lemmatizer leverages Arabic language knowledge resources to generate accurate lemma forms and relevant features. The algorithm achieves a maximum accuracy of 94.8% and 89.15% on first seen documents, outperforming the Stanford Arabic model's 76.7% on the same dataset. Why it matters: Accurate Arabic lemmatization is crucial for improving the performance of Arabic information retrieval systems, which can enhance access to Arabic language content.

MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs

arXiv · May 26

KAUST researchers introduced MOLE, a framework leveraging LLMs for automated metadata extraction from scientific papers. The system processes documents in multiple formats and validates outputs, targeting datasets beyond Arabic. A new benchmark dataset has been released to evaluate progress in metadata extraction.

An Empirical Study of Pre-trained Transformers for Arabic Information Extraction

arXiv · Apr 30

This paper introduces GigaBERT, a customized bilingual BERT model pre-trained for Arabic NLP and English-to-Arabic zero-shot transfer learning. The study evaluates GigaBERT's performance on four information extraction tasks: named entity recognition, part-of-speech tagging, argument role labeling, and relation extraction. Results show that GigaBERT outperforms mBERT, XLM-RoBERTa, and AraBERT in both supervised and zero-shot transfer settings. Why it matters: GigaBERT advances Arabic NLP by providing a high-performing, publicly available model tailored for the complexities of the Arabic language and cross-lingual applications.