Auto-SWE-Bench: Scalable, Real-World Benchmarks for LLM Coding Evaluation
Abstract
In this talk, Turing presents Auto-SWE-Bench, a framework for automatically generating large-scale, high-fidelity SWE-Bench datasets and environments from open-source GitHub repositories. Unlike static, manually curated datasets, Auto-SWE-Bench continuously sources real-world GitHub issues and pull requests across multiple programming languages, producing diverse tasks that reflect the true challenges of software engineering. Our framework ensures reproducibility, granular test feedback, and rigorous quality alignment, enabling dynamic and multilingual benchmarks at scale. For AI researchers working on code generation and reasoning, Auto-SWE-Bench provides a powerful platform to evaluate today’s systems and accelerate progress toward more capable coding models.