Poster
in
Workshop: Deep Learning for Code in the Agentic Era

SubtaskEval: Benchmarking LLMs on Competitive Programming Subtasks

Samik Goyal

Project Page [ OpenReview]

Abstract

Existing code generation benchmarks such as HumanEval, MBPP, and LiveCodeBench evaluate only full solutions, overlooking meaningful partial progress on competitive programming tasks. We introduce SubtaskEval, a benchmark of 287 olympiad problems (2017–2025) that preserves official subtask structures, metadata, and online-judge links. Evaluating six recent LLMs, including a tool-augmented variant, we find that even the best model achieves only 18.47\% accuracy (pass@1) though tool use improves subtask performance. Models exhibit bottom-heavy score distributions, in contrast to the more balanced distributions of human contestants. Subtask-based evaluation thus provides a finer-grained view of model problem-solving and highlight directions for advancing LLMs in code generation.

Chat is not available.