EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code
Abstract
Existing code generation benchmarks primarily evaluate functional correctness, with limited attention to code efficiency, and they are often restricted to a single language such as Python. To address this gap, we introduce EffiBench‑X, the first large‑scale multi‑language benchmark specifically designed for robust efficiency evaluation of LLM‑generated code. EffiBench‑X supports Python, C++, Java, JavaScript, Ruby, and Go, and comprises competitive programming tasks paired with human‑expert solutions as efficiency baselines. Evaluating state‑of‑the‑art LLMs on EffiBench‑X reveals that while models frequently generate functionally correct code, they consistently underperform human experts in efficiency. Even the most efficient LLM‑generated solutions (e.g., Qwen3‑32B) achieve only around 62% of human efficiency on average, with significant language‑specific variation: models tend to perform better in Python, Ruby, and JavaScript than in Java, C++, and Go (e.g., DeepSeek‑R1’s Python code is markedly more efficient than its Java code). These findings highlight the need for research into optimization‑oriented methods to improve the efficiency of LLM‑generated code across diverse languages. The dataset and evaluation infrastructure are publicly available at https://github.com/EffiBench/EffiBench-X.git and https://huggingface.co/datasets/EffiBench/effibench-x.