Poster
StackEval: Benchmarking LLMs in Coding Assistance
Zulkuf Genc · Nidhish Shah · Dogu Araci
West Ballroom A-D #5109
We present a comprehensive set of benchmarks to evaluate the performance of Large Language Models (LLMs) in coding assistance tasks, covering code writing, debugging, code review, and answering conceptual questions. Our main contribution includes three curated benchmarks: a coding assistance (StackEval) benchmark with 925 Stack Overflow questions, a recent coding assistance (StackEval-Recent) benchmark with 300 questions from the most recent Stack Overflow content, and an LLM-as-a-Judge benchmark featuring 136 LLM-generated answers validated by domain experts. These benchmarks offer novel insights into the capabilities and limitations of LLMs, particularly in handling new and emerging content. To ensure reproducibility and ongoing relevance, we publicly share our datasets and evaluation code, with plans to update the recent dataset biannually. Our findings underscore the potential of these benchmarks to advance LLM development and application in coding assistance.
Live content is unavailable. Log in and register to view live content