Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

arXiv · June 16, 2026 · Significant research

Summary

This paper presents a methodology for digitizing and encoding the Al-Mawrid Arabic-English dictionary, transforming it into a standardized computational lexicon using the ISO Lexical Markup Framework (LMF) and TEI Lex-0 guidelines. The research, based on an empirical analysis of the letter Ayn (4.6% of the dictionary), achieved a structural parsing accuracy of 91%. Quantitative evaluation showed high performance for information extraction rules, including 85% precision and 98% recall for synonyms. Why it matters: This work addresses a significant gap in Arabic lexical infrastructure, providing an interoperable, machine-tractable resource and a reproducible workflow for retro-digitizing complex legacy bilingual lexicons for Arabic NLP and Digital Humanities.

Keywords

Al-Mawrid dictionary · Arabic NLP · Computational Lexicon · TEI Lex-0 · Linguistic Linked Open Data

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

An Accurate Arabic Root-Based Lemmatizer for Information Retrieval Purposes

arXiv · Mar 15

This paper introduces a new non-statistical Arabic lemmatizer algorithm designed for information retrieval systems. The lemmatizer leverages Arabic language knowledge resources to generate accurate lemma forms and relevant features. The algorithm achieves a maximum accuracy of 94.8% and 89.15% on first seen documents, outperforming the Stanford Arabic model's 76.7% on the same dataset. Why it matters: Accurate Arabic lemmatization is crucial for improving the performance of Arabic information retrieval systems, which can enhance access to Arabic language content.

MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs

arXiv · May 26

KAUST researchers introduced MOLE, a framework leveraging LLMs for automated metadata extraction from scientific papers. The system processes documents in multiple formats and validates outputs, targeting datasets beyond Arabic. A new benchmark dataset has been released to evaluate progress in metadata extraction.

An Empirical Study of Pre-trained Transformers for Arabic Information Extraction

arXiv · Apr 30

This paper introduces GigaBERT, a customized bilingual BERT model pre-trained for Arabic NLP and English-to-Arabic zero-shot transfer learning. The study evaluates GigaBERT's performance on four information extraction tasks: named entity recognition, part-of-speech tagging, argument role labeling, and relation extraction. Results show that GigaBERT outperforms mBERT, XLM-RoBERTa, and AraBERT in both supervised and zero-shot transfer settings. Why it matters: GigaBERT advances Arabic NLP by providing a high-performing, publicly available model tailored for the complexities of the Arabic language and cross-lingual applications.