How to Build Agents to Generate Kernels for Faster LLMs (and Other Models!)
Abstract
The compute demanded by modern AI has been exploding since 2016; the FLOPs used to train frontier models have grown at a rate of 2.4x per year [0], and the inference side is growing even faster—already an estimated 80% of total AI electricity use [1]. Large language models and other deep networks rely on highly tuned GPU kernels to achieve state-of-the-art performance; these efficient kernels directly translate to cost and energy savings. In this 2.5-hour in-person tutorial, we demonstrate how LLM-powered agents can generate and optimize GPU kernels for CUDA, HIP/ROCm, and Triton. We begin with a unified primer on GPU‐programming fundamentals and common tooling (memory hierarchy, occupancy, profilers), then introduce an agentic loop: prompt engineering, compiler/profiler feedback as tools, iterative kernel refinement, correctness validation, and automated benchmarking. We will provide additional benchmarking examples on HIP and Triton, on top of Stanford’s KernelBench that covers CUDA [2], KernelBot as reliable source of human curated dataset for heterogenous GPU code [3], and show how to turn runtime and profiler metrics into reward signals that drive kernel optimizations. On top of this loop, we build an inference-scaling framework in which the LLM proposes candidate kernels, compiles them, measures latency/throughput/energy, and feeds those signals back as rewards. By combining test-time scaling techniques the agent iteratively discovers increasingly accurate and efficient kernels. Attendees will compare generated code against expert kernels, inspect wins and losses. By the end, participants will walk away with a reproducible pipeline for LLM-driven GPU‐kernel optimization.
Schedule
|
|
|
1:40 PM
|
|
|
|
|
|
|
|
3:30 PM
|
|
3:45 PM
|