Skip to content
GCC AI Research

ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination

arXiv · · Significant research

Summary

ArabDiscrim is a new corpus comprising 293,000 public Arabic Facebook posts from 2014 to 2024, specifically curated to discuss racism and discrimination. Unlike prior Twitter-centric datasets, it incorporates platform-native engagement signals, 200 curated terms with morphological regex families, and 20 discrimination axes. The resource also provides explicit attribution patterns and is released under a restricted research-use license for ethical compliance. Why it matters: This dataset provides a unique, ecologically valid foundation for fairness-oriented and platform-aware Arabic Natural Language Processing, moving beyond existing Twitter-centric resources.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

ArabJobs: A Multinational Corpus of Arabic Job Ads

arXiv ·

The ArabJobs dataset is a new corpus of over 8,500 Arabic job advertisements collected from Egypt, Jordan, Saudi Arabia, and the UAE. The dataset contains over 550,000 words and captures linguistic, regional, and socio-economic variation in the Arab labor market. It is available on GitHub and can be used for fairness-aware Arabic NLP and labor market research.

JobArabi: An Arabic Corpus and Analysis of Job Announcements from Social Media

arXiv ·

Researchers have introduced JobArabi, a new large-scale corpus consisting of 20,528 Arabic job announcements collected from X between January 2024 and October 2025. The dataset was compiled using a linguistically informed query framework covering various Arabic recruitment expressions, offering metadata like timestamps and geolocation for detailed analysis. Quantitative analysis of JobArabi reveals sociolinguistic patterns, including persistent gendered hiring language, regional occupational demand variations, and emotional framing in recruitment messages. Why it matters: This corpus provides a valuable resource for research in Arabic NLP, computational social science, and digital labor studies, offering unique insights into labor market communication and linguistic change in the Arab world.