This paper presents a methodology for digitizing and encoding the Al-Mawrid Arabic-English dictionary, transforming it into a standardized computational lexicon using the ISO Lexical Markup Framework (LMF) and TEI Lex-0 guidelines. The research, based on an empirical analysis of the letter Ayn (4.6% of the dictionary), achieved a structural parsing accuracy of 91%. Quantitative evaluation showed high performance for information extraction rules, including 85% precision and 98% recall for synonyms. Why it matters: This work addresses a significant gap in Arabic lexical infrastructure, providing an interoperable, machine-tractable resource and a reproducible workflow for retro-digitizing complex legacy bilingual lexicons for Arabic NLP and Digital Humanities.
This paper introduces a new non-statistical Arabic lemmatizer algorithm designed for information retrieval systems. The lemmatizer leverages Arabic language knowledge resources to generate accurate lemma forms and relevant features. The algorithm achieves a maximum accuracy of 94.8% and 89.15% on first seen documents, outperforming the Stanford Arabic model's 76.7% on the same dataset. Why it matters: Accurate Arabic lemmatization is crucial for improving the performance of Arabic information retrieval systems, which can enhance access to Arabic language content.
KAUST researchers introduced MOLE, a framework leveraging LLMs for automated metadata extraction from scientific papers. The system processes documents in multiple formats and validates outputs, targeting datasets beyond Arabic. A new benchmark dataset has been released to evaluate progress in metadata extraction.
This paper introduces GigaBERT, a customized bilingual BERT model pre-trained for Arabic NLP and English-to-Arabic zero-shot transfer learning. The study evaluates GigaBERT's performance on four information extraction tasks: named entity recognition, part-of-speech tagging, argument role labeling, and relation extraction. Results show that GigaBERT outperforms mBERT, XLM-RoBERTa, and AraBERT in both supervised and zero-shot transfer settings. Why it matters: GigaBERT advances Arabic NLP by providing a high-performing, publicly available model tailored for the complexities of the Arabic language and cross-lingual applications.