
This report conducted in February 2025 presents the first comprehensive evaluation of large language models for Arabic Retrieval-Augmented Generation (RAG) tasks, providing critical insights for technology leaders and enterprises across the Gulf Cooperation Council (GCC) region. The evaluation demonstrates significant performance variations among leading language models when processing Arabic content, offering data-driven guidance for regional AI implementation strategies.

The timing of this evaluation reflects broader regional trends in artificial intelligence adoption. Across the GCC, 70% of C-Suite executives recognize that significant digital transformation is necessary to remain competitive, yet only 32% have fully implemented AI solutions1217.
This gap represents both a challenge and an opportunity.
The United Arab Emirates has launched a $100 billion AI-focused investment fund2, while Saudi Arabia's Public Investment Fund has partnered with Google Cloud to establish advanced AI processing centers. Qatar has committed $550 million to AI infrastructure development, positioning the region as a global AI hub19.
However, despite these substantial investments, a critical challenge remains: most AI systems and language models are optimized for English-language tasks. For organizations operating in Arabic-speaking markets, this limitation impacts the effectiveness of AI implementations. Arabic presents unique linguistic complexities including morphological richness, dialectical variations, and right-to-left text processing requirements13.
Retrieval-Augmented Generation (RAG) represents a particularly important technology for regional enterprises. RAG enhances large language models by incorporating external knowledge sources, reducing hallucinations and enabling access to current, domain-specific information without requiring complete model retraining3514.
For GCC organizations dealing with regulatory compliance, technical documentation, and customer service in Arabic, RAG systems offer practical solutions for knowledge management and automated response generation.

RTG's evaluation framework employed a rigorous methodology to assess 12 leading language models across four distinct Arabic language contexts. The evaluation utilized Claude 3.5 Sonnet as an impartial judge, implementing a multi-criteria scoring system that reflects real-world business requirements.
RTG assessed 12 models across 4 Arabic task categories using 5 criteria:
| Criteria | Weight | Description |
|---|---|---|
| Correctness | 30% | Factual accuracy & source alignment |
| Completeness | 25% | Comprehensive answer scope |
| Conciseness | 20% | Efficient information delivery |
| Helpfulness | 15% | Practical implementation value |
| Technical Accuracy | 10% | Domain-specific terminology handling |