Efficient Transformers: State of the art in pruning, sparse attention, and transformer funneling
Transformer architectures consume the lionshare of computational budgets associated with today’s most powerful language and vision models, making research into greater computational efficiency a hot and essential direction. Our proposed tutorial surveys the bleeding edge of three complementary research threads that together comprise a significant part of the current industrial toolkit for achieving computational efficiency in Transformers: (1) pruning, the structured or unstructured removal of weights, layers and heads; (2) sparse attention & routing, including block, sliding-window, locality-sensitive hashing; and (3) funneling, which pools intermediate representations to shorten sequences through depth. We will then feature an expert industrial and academic panel of speakers from Google Deepmind, MIT, UC Berkeley, and Columbia, hearing about the latest trends seen in top industrial labs. Attendees will leave with actionable recipes for building sub-10 B-parameter models that match or exceed dense baselines on language, vision and multi-modal benchmarks.
The tutorial targets researchers and practitioners who build or deploy Transformer models and assumes familiarity with basic deep-learning concepts but not with any specific efficiency method. All slides and publication materials will be released under a permissive license.
From Tuning to Guarantees: Statistically Valid Hyperparameter Selection
"The performance and reliability of modern machine learning systems depend critically on hyperparameter selection. Whether tuning a large language model, configuring a vision pipeline, or deploying AI in safety-critical environments, the choice of hyperparameters is decisive. Current tuning strategies such as grid or random search and Bayesian optimization are powerful for empirical optimization but they do not provide statistical guarantees on the reliability of the selected configuration after deployment. This gap becomes critical when models must satisfy strict performance, safety, or fairness requirements.
This tutorial introduces a rigorous and practical framework that treats hyperparameter selection as a statistical testing problem. By constructing valid p- or e-values for candidate configurations and applying multiple hypothesis testing (MHT) procedures, practitioners can control deployment risk with finite-sample guarantees. We begin with the Learn-Then-Test (LTT) methodology for average-risk control and build up to multiple key extensions, such as controlling the quantile risk using quantile LTT (QLTT), multi-objective optimization through Pareto Testing (PT), incorporating prior information through the concept of reliability graphs, and data-efficient selection through adaptive LTT (aLTT). Throughout the tutorial, we emphasize conceptual clarity, plain-language explanations of assumptions, and hands-on demonstrations with minimal, reproducible notebooks.
Attendees will gain a drop-in toolkit for augmenting existing tuning workflows with statistically valid selection. They will learn how to formalize relevant risk functions, generate valid evidence, choose appropriate error-rate controls (FWER/FDR), and navigate the trade-offs between statistical conservatism and power under limited data. No prior expertise in multiple hypothesis testing is required."
Geospatial foundation models (GeoFMs) are a class of large-scale deep learning models, typically based on the transformer architecture, that are pre-trained on vast, diverse datasets of Earth Observation data to learn a general, transferable understanding of the Earth’s surface. These models help address long-standing challenges in Earth Observation by dramatically reducing the need for manually labeled data, handling vast and diverse data streams (e.g., optical, SAR, multispectral, LiDAR), and enabling robust performance across time, space, and sensor types. In this tutorial, we will give an overview of the recent advancements in GeoFMs, highlighting the main challenges in developing these models and differences from foundation models developed for other domains. We will also show practical examples of fine-tuning and inferencing GeoFMs for different downstream tasks using the TerraTorch open-source framework, which facilitates the use of publicly available GeoFMs such as SatMAE, Prithvi-EO, DOFA, Galileo and TerraMind. Finally, we will introduce best practices for systematic and reproducible benchmarking of GeoFMs using the TerraTorch Iterate plug-in and its integration with GEO-Bench.
We are living through a moment that once belonged to science fiction: generative foundation models can write, reason, design, diagnose, and increasingly, decide. They are no longer just predicting the next word — they are shaping knowledge, influencing choices, and becoming collaborators in science, medicine, education, and daily life. But here's the tension: as their capabilities accelerate, our ability to trust them has not kept pace.
Trustworthiness can't remain a "patch after the failure" or a moral hope layered on top of engineering. It must evolve into a science—a discipline as rigorous as the one that created these models in the first place. In this tutorial, we explore what that science looks like: how we understand model behaviors, measure and stress-test trust, and design systems that earn it. We'll build the foundations together, then step into the frontier—where models begin to exhibit human-like cognitive behaviors that inspire wonder, but also demand responsibility and new forms of alignment.
This session is an invitation: to move beyond building models that impress us, toward building models we can trust with what matters.
How to Build Agents to Generate Kernels for Faster LLMs (and Other Models!)
The compute demanded by modern AI has been exploding since 2016; the FLOPs used to train frontier models have grown at a rate of 2.4x per year [0], and the inference side is growing even faster—already an estimated 80% of total AI electricity use [1]. Large language models and other deep networks rely on highly tuned GPU kernels to achieve state-of-the-art performance; these efficient kernels directly translate to cost and energy savings. In this 2.5-hour in-person tutorial, we demonstrate how LLM-powered agents can generate and optimize GPU kernels for CUDA, HIP/ROCm, and Triton. We begin with a unified primer on GPU‐programming fundamentals and common tooling (memory hierarchy, occupancy, profilers), then introduce an agentic loop: prompt engineering, compiler/profiler feedback as tools, iterative kernel refinement, correctness validation, and automated benchmarking. We will provide additional benchmarking examples on HIP and Triton, on top of Stanford’s KernelBench that covers CUDA [2], KernelBot as reliable source of human curated dataset for heterogenous GPU code [3], and show how to turn runtime and profiler metrics into reward signals that drive kernel optimizations. On top of this loop, we build an inference-scaling framework in which the LLM proposes candidate kernels, compiles them, measures latency/throughput/energy, and feeds those signals back as rewards. By combining test-time scaling techniques the agent iteratively discovers increasingly accurate and efficient kernels. Attendees will compare generated code against expert kernels, inspect wins and losses. By the end, participants will walk away with a reproducible pipeline for LLM-driven GPU‐kernel optimization.
Positional Encoding: Past, Present, and Future
Positional Encoding is a foundational yet often opaque component of Transformer architectures, underpinning how self-attention mechanisms capture sequence order in language, vision, and multimodal models. Despite its centrality to the success of modern LLMs, and other attention-reliant architectures, the mathematical intuition behind positional encoding remains challenging and inaccessible to many researchers and practitioners. This workshop aims to demystify positional encoding by bridging formal theory with intuitive understanding and practical experimentation. Through a series of guided lectures participants will explore the operational principles behind effective positional representations, the evolution of key methods (from sinusoidal and learned embeddings to rotary and relative encodings), and open challenges that motivate current research directions. We will also provide open-source code implementations, mathematical visualizations, and collaborative ideation sessions for fostering new positional encoding concepts. By easing the barrier to entry for this mathematically intensive, yet crucial topic, the workshop seeks to foster deeper understanding, interdisciplinary exchange, and novel contributions to the future of Positional Encoding, and Transformer design