Predicting Emergent Software Engineering Capabilities by Fine-tuning
Abstract
Large Language Models exhibit unpredictable performance jumps on downstream tasks, and understanding when these emergent abilities arise remains challenging. While this has been observed across a variety of tasks, the extent to which it may pose an issue depends on the task at hand. This work extends emergence prediction to SWE-bench by fine-tuning LLaMA-3-1–8B and Qwen3-14B, demonstrating that task-specific fine-tuning accurately predicts higher capabilities—thus suggesting how larger models will behave. We fit an empirical emergence law by varying fine tuning data, showing that tracking the performance of smaller models may allow us to predict the performance of larger models on SWE-bench, using only a fraction of the computational resources. Validation on SWE-bench reveals that fine-tuned models achieve improved success rates (up to 44% vs. 5% untuned baseline), with the fitted emergence law accurately anticipating performance thresholds (LLaMA RMSE=2.22, R2 = 0.95: Qwen RMSE = 1.02, R2 = 0.99).