SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code
Natalia Tarasova · Enrique Balp-Straffon · Aleksei Iancheruk · Yevhenii Sielskyi · Nikita Kozodoi · Liam Byrne · Jack Butler · Dayuan jiang · Marcin Czelej · Andrew Ang · Yash Shah · Roi Blanco · Sergey IVANOV
Abstract
Infrastructure-as-code (IaC) is critical for cloud reliability and scalability, yet LLM capabilities in this domain remain underexplored. Existing benchmarks focus on declarative tools like Terraform and full-code generation. We introduce SWE-InfraBench, a dataset of realistic incremental edits to AWS CDK repositories from real-world codebases. Each task requires modifying existing IaC based on natural language instructions, with correctness verified by passed tests. Results show current LLMs struggle: the best model (Sonnet 3.7) solves 34% of tasks, while reasoning models like DeepSeek R1 reach only 24%
Chat is not available.
Successful Page Load