Technical vs Cultural: Evaluating LLMs in Arabic
Abstract
We present a pilot evaluation framework for language models in Arabic, revealing nuanced performance patterns across technical and cultural dimensions. We evaluate five prominent models—Arabic-specialized systems (Fanar, Falcon 3) and frontier models (Claude Opus, GPT-5, Llama)—across a small set of 45 prompts spanning general knowledge, trust and safety, and mathematical reasoning. Using four-dimensional scoring, we find varied performance patterns. While Claude (and frontier models in general) excel in technical accuracy, Arabic-specialized models demonstrate competitive cultural context and language quality, with Fanar showing strong linguistic competency. Mathematical reasoning emerges as the primary technical differentiator, while cultural competency shows less variation between specialized and frontier models than initially hypothesized. These findings highlight the need for new assessment approaches as new models emerge and the importance of balancing technical accuracy with cultural and linguistic fluency, suggesting domain-specific optimization may be more effective than broad specialization.