Talk Tue, Dec 2, 2025 • 12:45 PM – 12:57 PM PST Exhibit Hall A,B

Auto-SWE-Bench: Scalable, Real-World Benchmarks for LLM Coding Evaluation

Lilin Wang

Abstract

In this talk, Turing presents Auto-SWE-Bench, a framework for automatically generating large-scale, high-fidelity SWE-Bench datasets and environments from open-source GitHub repositories. Unlike static, manually curated datasets, Auto-SWE-Bench continuously sources real-world GitHub issues and pull requests across multiple programming languages, producing diverse tasks that reflect the true challenges of software engineering. Our framework ensures reproducibility, granular test feedback, and rigorous quality alignment, enabling dynamic and multilingual benchmarks at scale. For AI researchers working on code generation and reasoning, Auto-SWE-Bench provides a powerful platform to evaluate today’s systems and accelerate progress toward more capable coding models.

Video

Chat is not available.