NeurIPS 2025 Tuesday 12/2

Skip to yearly menu bar Skip to main content

Timezone: America/Los_Angeles

Full Schedule Sun 11/30 Mon 12/1 Tue 12/2 Wed 12/3 Thu 12/4 Fri 12/5 Sat 12/6 Sun 12/7

Registration Desk

Registration Desk

7:30 AM - 6:00 PM

Affinity Workshop

Women in Machine Learning

Tiffany Vlaar · Nikita Saxena · Tatjana Chavdarova · Ana María Quintero-Ossa · Kairan ZHAO · Kristina Ulicna · Trisha Mittal · Aishwarya Jadhav · Akshita Ramya Kamsali · Man Luo · Zeinab Abboud · Jean Amukwatse

8:00 AM - 5:00 PM

Expo Talk Panel

Toward General Full Autonomy: Open Research Challenges in Scalable Self-Driving

Ben Snyder · John Anderson

8:30 AM - 9:30 AM

Speaker: Ben Snyder, Director of AI Research for Autonomous Vehicles, General Motorsx000D
x000D
GM’s driver-assistance systems now power millions of vehicles across North America, logging over half a billion hands-free miles with zero reported crashes—demonstrating the safety and scalability of autonomy at unprecedented consumer scale. Following GM’s acquisition of Cruise earlier this year, the combined teams now bring together a decade of experience to accelerate progress toward full, generalized autonomy. This talk will dive into the open research challenges ahead—from model architecture and the balance between imitation and reinforcement learning, to leveraging vision-language models for long-tail, common-sense reasoning, and advancing the training, deployment, and simulation infrastructure needed to scale truly generalized autonomy.

Expo Talk Panel

Foundational Generative Recommendations for E-Commerce

Ali Khanafer · Yang Liu · Gennady Pekhimenko · Jacob Marks

8:30 AM - 9:30 AM

Modern commerce platforms face the challenge of delivering personalized recommendations across billions of items to users with diverse intents, temporal dynamics, and cold-start scenarios. We present a generative foundation model for commerce built on Hierarchical Sequential Transduction Units (HSTU) that integrates Liquid Foundation Models (LFM) and custom CUDA kernels developed in collaboration with Nvidia for efficient training and online serving. Our approach demonstrates that generative methods unlock substantial gains through three key innovations: (1) large-scale contrastive learning with hard negative sampling; (2) temporal mechanisms that fuse multi-scale time signals (session, day, season) with commerce-specific features; and (3) optimized training and inference kernels. While results are promising, significant challenges remain in handling non-stationary preferences, growing product catalogue, and multi-objective optimization—we discuss our roadmap toward truly foundational commerce models that generalize across domains and market conditions.

Expo Talk Panel

GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection

Melissa Kazemi Rad

8:30 AM - 9:30 AM

The rapid integration of LLMs for synthetic data generation offers a powerful solution for data scarcity, a critical issue for training machine learning models. However, the resulting data quality is often questionable, requiring costly and time-consuming manual review. This oversight is especially challenging in the crucial domain of trustworthy and safe AI.

A significant need here is the automated identification of adversarial and harmful inputs, a process known as red-teaming, to improve guardrailing systems before deployment. Developing an effective, fully automated red-teaming approach capable of generating diverse, out-of-domain harmful content has been a long-standing challenge.

To address data scarcity in harmful text detection & the challenge of automated red-teaming, we introduce GRAID (Geometric & Reflective AI-Driven Data Augmentation), a novel, versatile, & dynamic multi-agent framework.

GRAID operates in two stages:
1) Geometric Generation: A constrained, fine-tuned LLM generates geometrically controlled examples to diversify content within the original embedding space.x000D
2) Reflective Augmentation: A multi-agentic reflective process promotes stylistic diversity & uncovers difficult edge cases.

This combination ensures both reliable coverage of the input space & nuanced exploration of harmful content. We demonstrate that augmenting a harmful text classification dataset with GRAID significantly improves the performance of downstream real-world guardrail models. Furthermore, GRAID captures data variability in new geometric domains while preserving data relationships.

While initially focused on harmful text detection, GRAID’s modular design makes it an inherently domain-agnostic framework adaptable to various applications beyond classification. In this talk, we will detail how GRAID distinguishes itself from existing solutions, discuss its building blocks, & share insights on its easy adaptation for diverse synthetic data generation needs.

Expo Talk Panel

Multimodal Data Foundation at Industry-Scale

Hu Xu · Shang-Wen Li · Veloso · Aedamar Drummond

8:30 AM - 9:30 AM

Pre-training is fundamental to foundation models, enabling them to acquire broad knowledge that gives rise to emerging capabilities at later training stages, and scaling is the key for pre-training. In this talk, we present a recipe for building and curating pre-training, multimodal image-text paired data from scratch on a global scale, enabling mutual benefits between English and non-English data. We would like to share our key observations and insights with the community on: (1) why scaling matters, including the foundational role of data and key principles to hold for scaling; (2) how to design simple yet scalable data algorithms that enable industry-scale data collection and training without data filters, serving both research and production needs; (3) how the scaling improves Meta’s products at conventional and frontier machine learning areas. Submission is facilitated by Cogs & Marvel but is entirely organized, executed, and implemented by Meta.

Expo Talk Panel

Beyond Benchmarks: Rethinking Reasoning in Language Models

Iman Mirzadeh · Mehrdad Farajtabar

8:30 AM - 9:30 AM

Reasoning is often described as the next frontier for AI, but what does it really mean for a model to “reason”, and how should we measure it? Popular benchmarks like GSM8K suggest steady progress, yet controlled studies reveal that models can fail dramatically under small changes—such as swapping numbers or adding irrelevant details. Large Reasoning Models (LRMs), which generate explicit chains of thought, raise new optimism but also expose clear limits: they often underperform standard models on simple tasks, improve briefly at medium complexity, and then collapse on harder ones despite having unused compute. Crucially, reasoning is not the same as knowledge recall, tool use, or agent-like behavior. True reasoning involves solving novel problems, decomposing them into steps, generalizing to new contexts, recombining partial results, and finally generating novel hypotheses—capabilities current systems largely lack. Today’s evaluations, focused on final answers and contaminated benchmarks, risk giving a misleading sense of progress. This talk will provide a critical review of reasoning in language models, highlight why current evaluations can be deceptive, and emphasize that reasoning is not just about “what" models answer, but “how" they solve problems.

Expo Talk Panel

Telling Stories at Scale: Multimodal ML in the Global Media Landscape

Katelyn McDaniel · Ritwik Kumar

8:30 AM - 9:30 AM

Netflix has long been recognized as a pioneer in personalization, leveraging member preference data to recommend the most engaging shows, movies, and games. In recent years, we have expanded our use of machine learning and data-driven approaches to support a wide array of upstream creative and operational workflows. In this talk, we will discuss how modern AI methods – contrastive learning, transformers, cross-modal retrieval, graph neural networks, etc. – are transforming the curation, localization, promotion, and launch of stories on a global scale. A distinctive aspect of our work is the integration of highly creative assets—text, images, video, and speech—alongside traditional tabular datasets. We will highlight unique challenges that arise at the intersection of multimedia, personalization, and web-scale products, and share how advanced ML/AI techniques are addressing these challenges to connect great stories with our worldwide audience.

Expo Talk Panel

Data Scout: “From Prompt to Corpus: Accelerating Domain-Specific Data Collection with LLMGuided Discovery”

Chirag Garg

8:30 AM - 9:30 AM

Training large language models for specialized disciplines such as advanced mathematics, molecular biology, or legal reasoning is limited by the scarcity of large, high quality, domain specific corpora. Most publicly available datasets are dominated by general purpose web text. When available, specialized data are fragmented across diverse sources such as preprints, conference papers, forums, lecture notes, and digitized books. No single source offers comprehensive real-world coverage across scientific domains. Consequently, scaling up authentic domain data remains a bottleneck: collecting a subset of relevant tokens often requires downloading and filtering hundreds of terabytes of raw web material, a process that is both time consuming and costly. x000D x000D We introduce Data Scout, a modular, LLM powered pipeline that turns a high-level user intent (e.g., “I need data for advanced mathematics”) into a vetted, list of seed URLs in minutes. The system first expands the original intent using an LLM that generates a hierarchical subgraph of related concepts; this taxonomy drives a diversified set of search queries that systematically cover the target domain while respecting known licensing signals. Candidate URLs are then filtered by the same LLM using chain of thought prompting based on topical relevance, licensing clarity, and crawlability. Our results show that that the list of selected candidate URLs when crawled can yield a high percentage of relevant pages (40%+) related to the user’s intended topic or query, compared to less than 1 percent in general web-scale corpora. Data Scout is available with both CLI and GUI front ends. By democratizing domain specific data acquisition, Data Scout enables researchers without dedicated crawling infrastructure to bootstrap large, high-fidelity corpora, accelerating the development of specialized LLMs across various niche domains fields.

Affinity Workshop

LatinX in AI

Laura Montoya · Abraham Ramos · Ana María Quintero-Ossa · Felipe Leno da Silva · Tania Lorido · Daniela Cortes Bermudez · Melissa Montes

9:00 AM - 5:00 PM

Tutorial

Model Merging: Theory, Practice and Applications

Marco Ciccone · Malikeh Ehghaghi · Colin Raffel

9:30 AM - 12:00 PM

Tutorial

Foundations of Tensor/Low-Rank Computations for AI

Grigorios Chrysos · Evrim Acar · Antonio Vergari

9:30 AM - 12:00 PM

Tutorial

Energy and Power as First-Class ML Design Metrics

Jae-Won Chung · Ahmet Inci · Ruofan Wu

9:30 AM - 12:00 PM

Tutorial

Explain AI Models: Methods and Opportunities in Explainable AI, Data-Centric AI, and Mechanistic Interpretability

Shichang (Ray) Zhang · Himabindu Lakkaraju · Julius Adebayo

9:30 AM - 12:00 PM

Understanding AI system behavior has become critical for safety, trust, and effective deployment across diverse applications. Three major research communities have emerged to address this challenge through interpretability methods: Explainable AI focuses on feature attribution to understand which input features drive model decisions; Data-Centric AI emphasizes data attribution to analyze how training examples shape model behavior; and Mechanistic Interpretability examines component attribution to understand how internal model components contribute to outputs. These three branches share the goal of better understanding AI systems across different aspects and differ primarily in their perspectives rather than techniques. This tutorial begins with foundational concepts and historical context, providing essential background on why explainability matters and how the field has evolved since its early days. The first technical deep dive covers post hoc explanation methods, data-centric explanation techniques, mechanistic interpretability approaches, and presents a unified framework demonstrating that these methods share fundamental techniques such as perturbations, gradients, and local linear approximations. The second technical deep dive explores inherently interpretable models, clarifying concepts like reasoning (chain-of-thought) LLMs and self-explanatory LLMs in the context of explainability, and techniques for building inherently interpretable LLMs. We also showcase open source tools that make these methods accessible to practitioners. Furthermore, we highlight promising future research directions in interpretability research and the induced future directions in AI more broadly, with applications in model editing, steering, and regulation. Through comprehensive coverage of algorithms, real-world case studies, and practical guidance, attendees will gain both a deep technical understanding of state-of-the-art methods and practical skills to apply interpretability techniques effectively in AI applications.

Tutorial

New Frontiers of Hyperparameter Optimization: Recent advances and open challenges in theory and practice

Dravyansh Sharma · Colin White · Maria-Florina Balcan

9:30 AM - 12:00 PM

Machine learning algorithms operate on data, and for any task the most effective method depends
on the data at hand. Hyperparameter optimization and algorithm selection are therefore crucial to ensure the best performance in terms of accuracy, efficiency, reliability, interpretability, etc. We first survey common techniques used in practice for hyperparameter optimization in machine learning including Bayesian optimization and bandit-based approaches. We next discuss new approaches developed in the context of Large Language Models, including neural scaling laws and parameterization-aware methods. We will discuss chief advantages and shortcomings of these approaches, in particular their limited theoretical guarantees. We will then discuss exciting new developments on hyperparameter tuning with strong theoretical guarantees. A growing line of work over the past decade from the learning theory community has successfully analysed how the algorithmic performance actually varies with the hyperparameter for several fundamental algorithms in machine learning including decision trees, linear regression, unsupervised and semi-supervised learning, and very recently even deep learning. This has allowed the development of techniques that take this structure into account, apply naturally to both hyperparameter tuning and algorithm selection, work well in dynamic or online learning environments, and are equipped with provable PAC (probably approximately correct) guarantees for the generalization error of the learned hyperparameter. Future research areas include integration of these structure-aware principled approaches with the currently used techniques, better optimization in high-dimensional and discrete spaces, and improving scalability in distributed settings.

Tutorial

Human-AI Alignment: Foundations, Methods, Practice, and Challenges

Hua Shen · Mitchell Gordon · Adam Tauman Kalai · Yoshua Bengio

9:30 AM - 12:00 PM

Tutorial

Planning in the Era of Language Models

Michael Katz · Harsha Kokel · Christian Muise

9:30 AM - 12:00 PM

For over six decades, the field of automated planning has been at the heart of AI, empowering intelligent systems to reason, act, and achieve goals in complex, dynamic environments. From robotics and logistics to space exploration, planning research has fueled autonomous decision-making in real-world applications.

Today, as large language models redefine what’s possible in AI, the principles and methodologies of planning are more vital than ever. The planning community brings decades of experience in designing, benchmarking, and interpreting intelligent behavior; expertise that can accelerate the development of powerful, trustworthy, and general-purpose LLM-based agents.

Participants will gain a clear understanding of what planning truly entails, what has been learned (and sometimes forgotten) in the shift toward LLM-based approaches, and how foundational insights from the planning community can inform the creation of stronger, more reliable, and more scalable LLM-powered planners.

Education Track

Fei Fang · Aditya Grover · Naveen Raman

10:00 AM - 5:00 PM

19 Events in this session

Education Material Showcase

Tutorial led by Julien Besset

Coffee Break

Contributed Talk #1 - Application of Multi-Agent Systems for Essay Scoring

Contributed Talk #2 - Understanding How Neural Networks See

Contributed Talk #3 - MCP Explorer: Interactive Learning Experience

Contributed Talk #4 - The Invisible Fingerprints: Protecting Your Digital Images

Contributed Talk #5 - How to Train a Model: A hands-on, interactive guide to understanding how calculus is used to train AI models.

Contributed Talk #6 - IoT-MCP: Design and Control IoT Systems via LLMs

Lunch Break

Invited Talk #1 - Serena Booth

Invited Talk #2 - Eunice Jun

Coffee Break

Contributed Talk #7 - Understanding Bias and Ethics in AI

Contributed Talk #8 - Benchmarks 101

Contributed Talk #9 - AI-Powered Agritourism Toolkit: Revolutionizing Agritourism with AI and Machine Learning

Contributed Talk #10 - Interpreting Large Language Models (LLMs) with Geometry

Contributed Talk #11 - AI & Human Co-Creativity: LLM Hallucinations are Features, Prompts are Experiments, AI is not a being

Panel Discussion

Go to Event Page

Expo Demonstration

ContextForge

Frederico Araujo

12:00 PM - 3:00 PM

The rapid rise of autonomous AI agents across enterprises is creating a new class of security and governance challenges that are not adequately addressed with today’s technology. Context Forge MCP Gateway is an open-source, security-focused middleware that provides fine-grained control and extensibility for agent operations. With over 2.6k GitHub stars and a rapidly growing user community, Context Forge addresses emerging threat classes including prompt injection, data leakage, and misuse of sensitive resources. At its core, Context Forge introduces a plugin architecture modeled after Linux Security Modules, embedding reusable security hooks at critical points in agent execution (e.g., prompt handling, tool invocation, data transformation). This modular foundation enables organizations to enforce contextual policies at scale—ranging from PII redaction and provenance tagging to prompt injection detection and policy-based access control. With 39 plugins already available, Context Forge is establishing a standards-aligned ecosystem for securing agent workflows in real-world enterprise deployments. By blending research-driven design with open-source adoption it creates a practical path for organizations to advance agent trustworthiness, safety, and compliance.

Expo Demonstration

LLM-Powered Intelligent Data Engineering: From Workflow Design to Ingestion andQuality Assurance

Shashank Mujumdar

12:00 PM - 3:00 PM

Modern enterprises depend on efficient data engineering pipelines to unlock value from diverse and large-scale datasets. Yet, current processes for workflow design, schema ingestion, and data quality validation remain complex, error-prone, and dependent on technical expertise. This creates barriers for non-expert users, slows down development, and introduces risks of data inconsistency.x000D x000D We present a suite of LLM-powered frameworks that reimagine enterprise data engineering across three critical dimensions: (i) From Natural Language to Executable ETL Flows, enabling intuitive pipeline creation with natural language specifications and automatic operator/property inference, (ii) All You Can Ingest, an end-to-end schema mapping and transformation framework that unifies semantic alignment, code synthesis, and robust validation, and (iii) Quality Assessment of Tabular Data, a scalable approach for auto-generating interpretable quality rules and executable validators tailored to specific datasets.x000D x000D Together, these innovations demonstrate how Large Language Models (LLMs), augmented with retrieval, code synthesis, reasoning, and guardrails, can transform the data engineering lifecycle into a more accessible, adaptive, and trustworthy process, reducing manual effort, accelerating time-to-value, and ensuring data fidelity at enterprise scale.

Expo Demonstration

Learning to Steer LLMs with AI Steerability 360 and In-Context Explainability 360

Erik Miehling · Dennis Wei

12:00 PM - 3:00 PM

Current algorithms for aligning LLM behavior are often implemented for narrow settings, making it difficult for researchers and developers to understand their effectiveness across model architectures, datasets, and tasks. To help provide a more informed and principled approach to steering model behavior, we present the AI Steerability 360 (AISteer360) and In-Context Explainability 360 (ICX360) toolkits. Participants will first be guided through a conceptual overview for how model behavior can be influenced across four model control surfaces: input (prompting), structural (weights/architecture), state (activations/attentions), and output (decoding). After the conceptual overview, we will guide attendees through how to apply some recently developed explainability tools (from ICX360) for understanding why models produce given, potentially undesirable, outputs and how this information is used to design targeted steering inventions (via AISteer360). Closing the loop, we will evaluate if the baseline behavior (of the original, unsteered model) was successfully mitigated by the selected steering inventions and investigate if steering introduced any unintended behavioral side-effects. All of the experiments throughout the demonstration will be facilitated solely by the tools in the two toolkits, illustrating their power to design end-to-end steering workflows. Attendees will come away with a practical understanding of how to apply these toolkits to their own alignment challenges.

Talk

Exhibitor Spot Talks - Session 1

12:00 PM - 6:00 PM

31 Events in this session

Exhibitor Talk - Simular

Recycling the World Computer: Fault-Tolerant LLM Training on Idle GPU Capacity

Jason Mancuso

Half Is Heroic: Rewarding Non-Answers for Responsible AI Decision-Making

Sergio Bruccoleri

Auto-SWE-Bench: Scalable, Real-World Benchmarks for LLM Coding Evaluation

Lilin Wang

Exhibitor Talk - Optiver

Scott McKenzie

Tabular and Causal Foundation Modelling: TabDPT and CausalPFN

Anthony L Caterini

Exhibitor Talk

Rights Management Capabilities and Enforcement for AI Music

Exploring Pathways into Quantitative Research

Jamie Watson · Oliva Bateman

Semantic Parsing at Bloomberg

Sachith Sri Ram Kothur

Entropy by Design: Synthetic Data at Scale

Marah Abdin

ATLAS: AdapTive-LeArning Speculator System for Real-Time LLM Inference Acceleration with Together AI

Junxiong Wang · Ben Athiwaratkun

Exhibitor Talk - Cohere

Drug Discovery with Large-Scale, Physics-Based Synthetic Data at D. E. Shaw Research

Peter Skopp

Agentic AI: Exploring Evolution and Evaluation

Lu Lu

Measuring Enterprise Agents - Production Readiness Index and Evaluation Environment

Measuring Emergent Behavior in AI Agents - Weights & Biases

New Frontier of AI: Eval, RL, and What's Next

Bing Liu

Latent Thought Models with Variational Bayes Inference-Time Computation

Jianwen Xie

Exhibitor Talk - MathWorks: Would you trust your AI model with your life?

Lucas Garcia

I for Code: What We’ve Learned (and Whats Next)

From Agent Soup to Proper Software Design: Putting the Developer Back in Control of Generative AI with Mellea

David Cox

Realizing Personal and Enterprise AI Twins - Lenovo

Oguz Elibol

From RLHF to RL Environments - Welcoming Agents to the Real World

John Cutter

Exhibitor Talk - WRITER

Daniel M. Bikel

Exhibitor Talk - Rennaissance Philanthropy

Algorithmic Trading with Large-Scale Deep Learning

Automated Curation of Foundation-Scale Pretraining Datasets

Matthew Leavitt

Databricks Presents a New IDP Benchmark

Decentralized Diffusion Models

Bidhan Roy

The State of Open Source AI and a Look Into the Future

Nader Khalil

Go to Event Page

Expo Workshop

Exploring Trust and Reliability in LLM Evaluation

Shixiong Zhang · Sambit Sahu · MILIND NAPHADE · Jordan Lacey

12:00 PM - 1:30 PM

The current paradigm of Large Language Model (LLM) evaluation faces a crisis of reliability. Traditional leaderboards—built on static benchmarks and surface-level metrics—have become increasingly distorted by benchmark contamination, prompt overfitting, and evaluation methodologies that fail to reflect model behavior in real-world use. As reasoning models emerge that generate detailed internal thought processes (e.g., traces) before producing answers, existing evaluation practices—especially for multiple-choice and generation tasks—have become fundamentally inadequate.x000D
x000D
This lack of rigor not only undermines scientific progress and cross-model comparability, but also poses significant enterprise and societal risks, as evaluation results inform model selection, deployment safety, and governance in high-stakes environments.x000D
x000D
This workshop aims to reassert rigor in LLM evaluation by convening researchers and practitioners to address three intertwined challenges: (1) developing fair and consistent evaluation methods for reasoning and non-reasoning models, (2) confronting widespread contamination across public benchmarks and open-weight models, and (3) defining robust data curation and validation practices to prevent future contamination in both pretraining and post-training pipelines.x000D
x000D
By combining empirical findings, methodological advances, and practical case studies, this session—led by Capital One in collaboration with leading AI labs—seeks to chart a concrete path toward trustworthy, contamination-proof, and utility-aligned LLM evaluation frameworks.x000D
x000D
This 1.5-hour workshop will be structured around three highly focused, 25-minute talks, followed by a moderated discussion aimed at forging actionable paths forward for the community:x000D
x000D
Talk 1: Robust Evaluation for Reasoning & Non-Reasoning Modelsx000D
x000D
Talk 2: Benchmark Contamination — Detection, Measurement, & Findingsx000D
x000D
Talk 3: Preventing Contamination — Building Clean & Reliable Data Pipelines

Expo Workshop

Introduction to Generative Computing

Nathan Fulton · Hendrik Strobelt

12:00 PM - 1:30 PM

This hands-on workshop introduces a proposal that treats LLMs as computing elements governed by established software development principles—particularly task decomposition and modularization—at both the programming model (Mellea) and model level (LLM intrinsics).x000D x000D LLM outputs are often unpredictable and incorrect. Agentic frameworks and prompt optimization libraries attempt to manage this by giving control to the LLM, but this leads to systems that are hard to debug, maintain, and scale. Mellea offers an alternative: a programming model that restores developer control through modular design, information hiding, and compositional contracts. This enables predictable fault models, better portability, and lower inference costs. Attendees will gain hands-on experience building applications using the Melleaic approach.x000D x000D Extending these principles to the model level, the workshop introduces a modularization framework for LLMs using activated LoRAs. These produce components—LLM intrinsics—that match fine-tuned model accuracy for specific tasks but with significantly lower inference costs and latency, thanks to KV cache reuse. Participants will build applications using a pre-built library of RAG LLM intrinsics and learn how to train their own.x000D x000D Presented by the creators of Mellea and the inventors of LLM intrinsics and aLoRA, this workshop equips attendees with foundational skills for scalable model/application co-design.

Expo Workshop

Large-Scale Real-World Physical AI Systems

Ron Tindall

12:00 PM - 1:30 PM

Motivation and Scope x000D
x000D
Physical AI systems comprise of four things: namely sensors like cameras and lidar, mechanical and electronic control unit, AI models to reason about the environment, and actuators to convert decisions to physical actions. It marries multiple domains like sensor design, perception, low-power real-time hardware design, and control loop action design. Autonomous driving is the most mature physical AI domain deployed for over 10 years, but it still has many open challenges. Humanoid robots are an emerging physical AI domain with potential for near term commercial deployment. One of the major challenges in physical AI is to scale to all real-world scenarios including corner cases in a safe manner. A scalable AI data flywheel is the most critical module to achieve this. Traditional physical AI models have a modular decomposition of perception and action tasks, but the community is increasingly moving towards a single end-to-end AI model. Furthermore, recent advancements in LLMs and VLMs are leading to VLA (Vision-Language-Action) based end-to-end models. In the future, there will likely be a convergence of physical AI models across different domains like driving and robotics. The proposed workshop covers the latest research and best practices in industrial research of physical AI by leaders in the domain. It also covers emerging technologies like VLA based foundation models, AI data flywheel, and cross-embodiment learning focused on Physical AI.

Expo Workshop

Multimodal Superintelligence Workshop

Amir Zadeh · Chuan Li · Jason Zhang · Jessica Nicholson

12:00 PM - 1:30 PM

Multimodal machine learning is among the most promising directions of artificial intelligence. With remarkable progress in academia and industry on this topic, we are at the cusp of building next-generation multimodal models, i.e. multimodal superintelligence. These models can be defined as being able to observe, think, and act across several modalities. At this important junction, our workshop provides a forum for researchers to align and cross-polinate ideas. The Workshop on Multimodal Superintelligence will provide a venue where the community can gather to discuss the current state of multimodal machine learning science. We will also focus on topics such as cross-modal reasoning, alignment, fusion and co-learning.

Expo Workshop

Checkmate: Fine-tune your own small language model for real-time chess reasoning and gameplay on AWS Trainium

Emily Webber · Rocky Yu · Sharada Mohanty · Henry Yin

12:00 PM - 1:30 PM

In this hands-on workshop, participants will leverage AWS Trainium to fine-tune and deploy their own chess-playing language models. Building on recent research showing language models' effectiveness in reasoning, attendees will work with various chess datasets to create AI models that not only play chess but explain their strategic thinking through natural language. The 90-minute session will cover model fine-tuning techniques, optimization strategies specific to Trainium's architecture, and real-time deployment to a chess engine. The workshop culminates in a live tournament where participants' models compete against each other, providing immediate feedback on their implementations. Participants will leave with a working chess reasoning model, practical experience in fine-tuning language models on Trainium, and transferable skills for similar tasks. Python programming experience and familiarity with LLM concepts are required, in addition to a basic understanding of the rules of chess. Workshop materials and AWS credits will be provided.

Expo Demonstration

ALICE: Agentic Logic for Incident and Codebug Elimination

Ramesh Kumar Kottapalli

12:00 PM - 3:00 PM

Modern incident root-cause analysis (RCA) is constrained by partial observability, symptom-centric signals, and the overwhelming noise present in logs, traces, and metrics. Diagnosing production failures often depends on instrumentation quality and human expertise, while latent software defects, configuration errors, and zero-day failure modes remain difficult to pinpoint. To address these challenges, we demonstrate a multi-agent system for incident diagnostics that augments observability data with application source code and static analysis signals.x000D x000D Our system introduces two cooperating agents: the Code Context Agent (COCOA), which builds a knowledge graph of program dependencies, control/data flows, and caller–callee relationships; and the Incident Diagnostics Agent (IDA), which performs agentic reasoning over an entity topology graph enriched with observability streams. Together, these agents extend topology-aware planning (TAP) to simultaneously operate on program dependency graphs and infrastructure entity graphs, thereby linking runtime symptoms with underlying code-level causes.x000D x000D This demo showcases how multi-agent collaboration enables deeper, context-sensitive RCA. We walk through real-world inspired scenarios—including incidents where critical log lines are hidden in noisy observability streams or where latent defects emerge only after system updates—illustrating how the system surfaces root causes that would otherwise remain invisible. By bridging program analysis with runtime observability, our approach moves beyond symptom-driven diagnostics toward a more reliable, automated framework for incident management.

Expo Demonstration

Building Safe, Compliant, and Observable Agentic Systems for Generative AI Applications

Yiran Ivy Si

12:00 PM - 3:00 PM

SCOPE: Enterprise Agent Governance Framework

As Large Language Model (LLM) agents are increasingly deployed in mission-critical applications, ensuring their safety, compliance, and observability becomes paramount. We present SCOPE, a comprehensive governance framework designed for regulated environments like banking and healthcare.

The SCOPE acronym represents our five core pillars:

S – Safety (Multi-layer Safety Guardrails)

C – Compliance (Policy-as-Code)

O – Observability (Measurable Observability & Audit Trails)

P – Permissions (Identity-Aware Permissions / IAM)

E – Escalation (Human-in-the-Loop Escalation)

Built on Google's Agent Development Kit (ADK), SCOPE implements a "Defense in Depth" architecture. It combines fast ML-based classification (~50ms) with LLM-based contextual analysis for robust protection. It features Role-Based Access Control (RBAC) baked into the agent's core and enables hot-patching of business rules without code changes.

We demonstrate SCOPE's effectiveness through a live Banking Customer Service Agent that handles account inquiries, transactions, and fraud reports while maintaining compliance with PCI-DSS and SOC2 requirements. The framework is open-source and production-ready, offering a practical blueprint for trustworthy agent deployment.

Expo Workshop

CausalFairness: An Open-Source Python Library for Causal Fairness Analysis

Kriti Mahajan · Yun Wang

12:00 PM - 1:30 PM

As machine learning (ML) systems are increasingly deployed in high-stakes domains, the need for robust methods to assess fairness has become more critical. While statistical fairness metrics are widely used due to their simplicity, they are limited in their ability to explain why disparities occur, as they rely on associative relationships in the data. In contrast, causal fairness metrics aim to uncover the underlying data-generating mechanisms that lead to observed disparities, enabling a deeper understanding of the influence of sensitive attributes and their proxies. Despite their promise, causal fairness metrics have seen limited adoption due to their technical and computational complexity. To address this gap, we present CausalFairness, the first open-source Python package designed to compute a diverse set of causal fairness metrics at both the group and individual levels. The metrics implemented are broadly applicable across classification and regression tasks (with easy extensions for intersectional analysis) and were selected for their significance in the fairness literature. We also demonstrate how standard statistical fairness metrics can be decomposed into their causal components, providing a complementary view of fairness grounded in causal reasoning. In this active learning talk participants will learn how to quantify bias using CausalFairness at the group (Counterfactual Equalized Odds , Counterfactual Effects) and individual (Counterfactual Fairness) levels by applying each method to three datasets - 1) the Adult Income dataset, 2) the COMPAS dataset, 3) Law School Admission Council (LSAC) Dataset. The session will elucidate on the intuition for computing and interpreting each metric, and conclude with a discussion of their limitations.

Expo Demonstration

Who Needs Attention Anyway? Elastic State Models for Real-Time Streaming Tasks

Dario Fumarola

12:00 PM - 3:00 PM

Large attention models are great for offline reasoning, but their cost grows with context and their behavior is hard to bound. For systems that must react under tight latency and safety constraints - robots, simulators, industrial and chemical processes - that compute model is a poor fit.

We explore an Elastic State Model (ESM): a streaming state-space backbone with a small geometric correction block that only “wakes up” when the dynamics get fragile. At each step, a fast SSM predicts the next state, estimates how sensitive it is to small perturbations, and - when needed - takes a few extra preconditioned steps in latent space to correct the trajectory. Compute stays cheap on easy stretches and increases only around junctions, shocks, and high-stakes events, keeping latency and compute per step tightly bounded and enabling online adaptation at inference time, without retraining.

We illustrate this with two contrasting demos: a maze exploration agent that automatically spends more compute at new junctions and tight passages, and a protein “repair” scenario where extra effort is focused only on damaged or unstable regions of a molecule. Together they show how the same ESM block can power responsive, budget-aware decision-making in both robotics-style navigation and molecular simulation.

Expo Demonstration

Build verifiable apps using Generative AI and Automated Reasoning

Jinet Jose · Federico Mora

12:00 PM - 3:00 PM

Recent advancements in Generative AI have enabled customers to use LLMs to generate infrastructure code using AWS CLI commands. Because humans can make mistakes, when deployed such LLM-generated infrastructure code can have negative impacts, including on security.x000D
Motivated by this challenge, this demonstration introduces participants to automated reasoning tooling that enhanced security in production for Amazon Q chat. x000D
AWS Q Chat enables natural language interaction with AWS resources while employing automated reasoning to verify every generated API call against comprehensive semantic logic models. This prevents potentially harmful operations before execution and suggests corrections, creating a feedback loop that iterates until verifiably correct code is produced. Through this work, we demonstrate how organizations can leverage GenAI's efficiency while maintaining the rigorous verification standards required for production environments and participants will learn how to integrate these tools into their workflows to prevent security regressions and ensure reliable infrastructure management. This tutorial Scientists, Engineers, Security professionals and anyone interested in applying formal verification to their infrastructure.

Expo Demonstration

Interpretable AI for Risk-Based Assessment in Global Supply Chains

Prasanth Meiyappan · Neha Anna John · Salvatore D’Acunto · Erica Van Deren · Aggrey Muhebwa · Karl Wehden · Kommy Weldemariam

12:00 PM - 3:00 PM

Managing operational and compliance risks across a large, diverse supplier base is increasingly complex. Traditional audit-based approaches are resource-intensive and limited in scope, making a risk-based strategy essential to focus attention where potential issues are most likely to arise. To address this challenge, Amazon developed PRISM AI (Predictive Risk Intelligence for Supplier Management), an interpretable machine learning system that predicts and explains supplier-level risk across global supply chains. Trained on tens of thousands of audit and assessment records, PRISM integrates multiple data sources including self-assessment questionnaires, incident reports, external media signals, and geo-sector indicators. These inputs enable near–real-time identification of elevated risk patterns and emerging concerns across supplier networks. The model supports suppliers with varying data availability—those with extensive records, limited information, or none—by combining transfer learning, rule-based heuristics, and domain-specific indicators. Each prediction is accompanied by transparent attribution, showing which factors, such as certification gaps or regional exposure, most influenced the score. Built with monotonic constraints, the system ensures logically consistent and explainable outputs suitable for regulatory and operational contexts.

This demo provides NeurIPS participants with a hands-on view of how AI research can be operationalized for large-scale, real-world impact. PRISM helps compliance teams prioritize reviews, streamline supplier onboarding, and enhance oversight efficiency. For researchers, it illustrates techniques for building interpretable models under data imbalance and for integrating structured and unstructured signals. For practitioners, it demonstrates how AI can advance responsible sourcing and sustainability objectives across complex global ecosystems.

Expo Demonstration

BeeAI

Sandi Besen

12:00 PM - 3:00 PM

The BeeAI Framework is an open-source project for building reliable AI agents that combine autonomy with control. Current agent frameworks focus primarily on prompting and orchestration, leaving critical questions of predictability and safety unaddressed. BeeAI fills this gap with a lightweight framework that enables developers to build agents whose reasoning abilities are preserved while execution is constrained by declarative, rule-based requirements. At the core of the framework is the RequirementAgent, a novel agent design that enforces deterministic, controlled behaviors across heterogeneous language models. With RequirementAgent, developers can ensure consistent and reliable execution patterns regardless of differences in model reasoning, tool-calling abilities, or stochastic variation. This approach provides practitioners with a unified abstraction layer that simplifies the deployment of complex AI systems into production settings. As an incubating Linux Foundation AI project, BeeAI is gaining adoption in open source and enterprise contexts as organizations seek robust ways to operationalize AI agents at scale. At NeurIPS EXPO, we will showcase BeeAI’s architecture, real-world use cases, and lessons learned from applying declarative control to agent autonomy.

Expo Demonstration

Pushing the boundaries of chemical synthesis with RetroChimera

Krzysztof Maziarz

12:00 PM - 3:00 PM

Retrosynthesis - the task of planning chemical reaction recipes to synthesize complex molecules - remains a bottleneck in the discovery of novel pharmaceuticals. We recently released RetroChimera - a model for predicting chemical reactions - which demonstrated robustness well outside of training distribution by transferring zero-shot to internal reaction data at a major pharmaceutical company. We also found that industrial organic chemists prefer predictions from RetroChimera over real patented reactions in terms of quality, revealing a high degree of alignment. In this demo, we will showcase the model, let attendees query it live, and show them how to interpret the results.

Expo Demonstration

Multimodal AI Forensic Search for Video Surveillance

Ron Tindall

12:00 PM - 3:00 PM

Video surveillance often requires searching for specific targets from long-duration videos using multiple cameras. Traditional tracking‑and‑detection pipelines demand heavy manual filtering, and even recent multimodal approaches such as using CLIP remain limited to shallow visual attributes (e.g., clothing color) and weak temporal reasoning. This makes forensic search labor‑intensive. x000D
x000D
We present ForeSea, a novel AI forensic search system that supports rich multimodal queries (text + image) and returns timestamped evidence of key events. ForeSea is organized as a multi‑stage pipeline that couples tracking and retrieval with time‑aware VideoLLM reasoning: (1) uses tracking model to filter out irrelevant segments (e.g., frames without people) and produces person‑centric clips; (2) retrieval constructs an index over tracked clips to form a searchable database; and (3) during inference, the multimodal query is embedded to retrieve the top N candidate clips, which are then fed into a time-aware VideoLMM that performs temporal grounding and generates precise answers from concise input. Through ForeSea's multi-stage pipeline, we can search for targets using both image and text queries (e.g., asking 'When does this person get involved in a fight?' with an image of the person). This approach eliminates the need for detailed textual descriptions and enables effective temporal understanding across long videos. x000D
x000D
To evaluate LMM based forensic search, we introduce AI Forensic‑QA, a benchmark for multimodal video question answering with temporal grounding. On this benchmark, ForeSea achieves an 8.6 % accuracy improvement and a 6.9 (IoU) gain over strong baselines. To the best of our knowledge, this is the first benchmark in this domain to support multimodal queries evaluation. Our live demo showcases multimodal search, timestamped evidence visualization, and side‑by‑side comparisons with SOTA models.

Expo Demonstration

Efficient LiDAR Processing with AI Models Leveraging Heterogeneous Compute

Ron Tindall

12:00 PM - 3:00 PM

This demo showcases heterogeneous compute execution of a LiDAR model running in real time on an edge device. The LiDAR processing, specifically 3D sparse convolution (spconv3d) network, runs on the Qualcomm Adreno GPU, while the Region Proposal Network (RPN) executes on the Qualcomm Hexagon NPU. This division of labor across specialized processors reduces on-device inference latency and maximizes overall efficiency. Additionally, a lightweight, learnable voxel removal layer that hierarchically prunes redundant voxels further reduces inference time without compromising detection accuracy. x000D
x000D
"This Proposal is provided for review and evaluation purposes only. Do not redistribute to any third party without the express prior written consent of Qualcomm Technologies, Inc." x000D
x000D
Implementation challenge that we tackle x000D
x000D
LiDAR models often combine different types of operations: irregular, sparse computations (e.g., SpConv3D) and dense convolutional layers (e.g., CNNs). These operations have distinct hardware affinities—SpConv3D is better suited for SIMT-style GPUs, while CNNs benefit from SIMD-style NPUs. Efficient execution requires mapping each part of the model to the most appropriate compute unit. x000D
x000D
Another challenge is the variability in voxel density across LiDAR frames. Not all voxels contribute meaningfully to object detection, many represent ground planes or distant background and can be safely discarded. However, identifying and removing these in a lightweight, learnable way is non-trivial.

Expo Demonstration

Parallel generation with verification on device

Ron Tindall

12:00 PM - 3:00 PM

In this work, we address the challenges of efficiently generating and verifying multiple responses from large language models (LLMs) directly on device. While sampling with non-zero temperature often yields improved responses compared to greedy approaches, selecting the best response requires generating several candidates and evaluating them without incurring significant latency or resource overhead. Cloud-based solutions often rely on separate verification models, which are impractical for on-device deployment due to resource constraints. Our proposed solution leverages multi-stream execution graphs and parallel LLM generation, enabling joint generation and verification within a unified framework. Combined with post-processing techniques such as majority voting, this approach minimizes latency and optimizes the selection of high-quality responses, paving the way for more effective on-device LLM inference. x000D
x000D
Specific challenge that we tackle (research/implementation-wise)  x000D
x000D
Using non-zero temperature sampling with language models can result in higher-quality responses compared to greedy sampling, although this is not always assured. Achieving optimal output often requires generating multiple candidate responses and selecting the most suitable one for the user. This technique is widely adopted to enhance inference-time performance. When implemented on device, however, it presents two primary challenges: minimizing the latency associated with generating several responses and determining a resource-efficient method for selecting the best response from the generated set.

Expo Demonstration

Soft Prompts for On-Device Content Moderation

Ron Tindall

12:00 PM - 3:00 PM

We demonstrate the first on-device integration of a safety-aligned large language model (LLM) using soft prompt distillation, powered by our proposed TV-DiSP framework. Our system showcases how a mobile device can run a quantized LLM equipped with learned soft prompts to moderate harmful or toxic content in real-time. The demo highlights the difference in LLM outputs with and without our soft prompts when subjected to adversarial or unsafe inputs, enabling efficient and safe deployment of LLMs on edge devices. x000D
x000D
LLMs are known to produce unsafe or toxic outputs when prompted harmfully. Traditional safety mechanisms rely on dual-model architectures—pairing a base LLM with a separate guard model—which are memory and computationally expensive and unsuitable for deployment on resource-constrained devices like smartphones. The challenge is to achieve robust safety alignment without compromising latency, memory, or model utility in edge environments.

Expo Demonstration

Generating group photos of multiple people from text and reference images

Ron Tindall

12:00 PM - 3:00 PM

Reference-based multi-human image generation is emerging as a critical capability for personalization, synthetic data creation, and benchmarking generative models. Unlike single-subject generation, this task requires compositional reasoning to place multiple individuals—each with distinct identities—into a coherent scene guided by a text prompt. Existing models often fail to preserve identities or maintain spatial fidelity, which limits their applicability for real-world scenarios such as social content creation or training vision systems. x000D
x000D
Our demo addresses these challenges by showcasing a state-of-the-art system for reference-based multi-human generation. The system takes reference images of multiple individuals and a text description of the desired scene, then produces a high-quality image featuring all participants in context. Built on the Flux-Kontext backbone and trained using synthetic data from DisCo (arXiv:2510.01399), our RL-based approach optimizes multiple rewards including Human Preference Score (HPS3) and Average ID Similarity. Evaluation on MultiHuman-Testbench (arXiv:2506.20879) confirms state-of-the-art performance. x000D
x000D
This demo showcases fast generation on a laptop powered by a Snapdragon processor, highlighting the efficiency and scalability of our solution.

Expo Demonstration

Reasoning through Multimodal End-to-End Decision Transformer Networks and Vision Language Action (VLA) models

Ron Tindall

12:00 PM - 3:00 PM

This demonstration showcases the live output and visualization capabilities of an edge-integrated VLA model for path planning in automated driving scenarios. By harnessing raw multimodal sensor inputs, including visual and voice data, the VLA model processes information in real time to generate safe, explainable, and repeatable driving trajectories. The system operates on a Snapdragon Ride Elite SoC platform and incorporates safety guardrails, enabling robust decision-making and transparent reasoning. Attendees will observe how end-to-end AI networks interpret complex environmental cues to deliver actionable driving paths, with a special focus on complex use cases involving vulnerable road users and other actors on the road. This demonstration highlights advances in multimodal reasoning and edge deployment for next-generation intelligent mobility solutions.

Expo Demonstration

Disaggregated LLM Serving on AI Accelerators

Ron Tindall

12:00 PM - 3:00 PM

This demo showcases disaggregated serving on Qualcomm Cloud AI 100 Ultra Card, a power-efficient AI inference accelerator purpose-built for large language models (LLMs) serving. The accelerator has been deployed across multiple cloud service providers (CSPs) globally and is actively serving state-of-the-art LLMs and other generative AI workloads. x000D
x000D
LLM inference typically involves two distinct stages: prefill and decode. The prefill stage is compute bound, while the decode stage is memory bound. Applying uniform parallelism strategies across both stages often results in suboptimal performance, particularly in key metrics such as Time to First Token (TTFT) and Requests Per Minute (RPM) at the cluster level. x000D
x000D
This demo highlights the performance benefits of disaggregated parallelism strategies tailored to the unique characteristics of each stage. By optimizing the execution of prefill and decode independently, we demonstrate significant improvements in TTFT and overall throughput. x000D
x000D
Key benefits: x000D
x000D
Improved TTFT: Faster initial response times for LLM queries. x000D
x000D
Higher throughput: Increased number of requests served per minute at the cluster level. x000D
x000D
Optimized resource utilization: Efficient mapping of compute and memory resources to match workload characteristics. x000D
x000D
SLA-adherent performance: Maintains service quality and responsiveness within strict latency and throughput requirements.

Expo Demonstration

SwiftEdit: Fast Text-guided Image Editing via One-step Diffusion on a Mobile Device

Ron Tindall

12:00 PM - 3:00 PM

In this demo, we show an on-device inference of our one-step diffusion image editing model (SwiftEdit) [1] that performs interactive image editing based on the user’s source image and text prompt, running on an Android smartphone powered by Qualcomm Technologies’ latest Snapdragon Mobile Platform. On A100 GPUs, this technique can run in real-time with 0.23s per single edit operation. We expect SwiftEdit to perform each edit operation in seconds on the smartphone, demonstrating efficient and responsive on-device diffusion inference. x000D
x000D
Scientific Challenge that we tackle x000D
x000D
Existing text-guided image editing methods fell short of the speed demands required for real-world and on-device applications due to the costly multi-step inversion and sampling process involved. In response to this, we developed SwiftEdit that performed image editing using just one-step inversion and one-step image reconstruction. x000D
x000D
Efficiently running SwiftEdit requires concurrently on-boarding multiple deep models, including IP-Adapter (Vision Encoder and Image Projection), SwiftBrush (U-Net, VAE, Text Encoder), and SwiftBrush-based Inversion Network. This poses significant challenges for efficient execution and inter-module communication, while enabling an interactive image editing experience for the user — with all computation performed entirely on the edge device.

Expo Demonstration

Mobile Video Diffusion Transformers

Ron Tindall

12:00 PM - 3:00 PM

We demonstrate Neogradon, the first video diffusion transformer (DiT) designed to run on low-power NPUs in mobile devices, such as phones and laptops. Despite DiTs huge memory and computation cost due to the quadratic attention over thousands of video tokens, we show that mobile devices can run these models when being designed for efficiency. To achieve this level of efficiency: x000D
x000D
We replace the original large text encoder with a much smaller one with minimal quality loss through our novel distillation framework, which doesn’t require any image or video data. x000D
x000D
We propose an asymmetric decoder distillation approach, which allows us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the video generation pipeline. x000D
x000D
With our block pruning strategy, we remove entire blocks from the MMDiT denoiser based on their relative importance and recover the original performance through a two-stage distillation process. x000D
x000D
We reduce the diffusion sampling cost using our novel extended version of DMD (distribution matching distillation) for the pyramidal flow-matching objective. x000D
x000D
Neodragon generates 49 frames of 640x1024 resolution within 7.6 seconds on the Qualcomm Hexagon NPU with the VBench total score of 81.61, setting a new state of the art for mobile video generation. x000D
x000D
"This Proposal is provided for review and evaluation purposes only. Do not redistribute to any third party without the express prior written consent of Qualcomm Technologies, Inc."

Expo Workshop

Creative and Protective AI for Music and Entertainment

Chieh-Hsin Lai · Yuki Mitsufuji · Kazumi Fukuda

12:00 PM - 1:30 PM

Generative AI is reshaping how we create, experience, and safeguard music and entertainment. This workshop presents technologies that expand creative expression while honoring responsibility. On the creative side, we share collaborative artworks with leading sound artists, neural engines for sound design and performance, and automatic mixing that adapts to musical intent. We also present a large multimodal dataset for multishot speech video that supports research on coherent and controllable speech, together with specialized language models that orchestrate camera transitions, gestures, vocal cues, and sound effects. On the protective side, we advance AI methods for data attribution, traceability, and responsible model behavior that safeguard creative data and prevent unintended memorization, ensuring fairness, transparency, and respect for creators’ rights. Together, these threads outline an ecosystem in which AI amplifies artistic practice while preserving the integrity of human contribution.

Affinity Workshop

New in ML

1:00 PM - 5:00 PM

The New In ML Workshop provides a platform for newcomers to machine learning research to present their work and connect with the broader ML community. Our goal is to create an inclusive environment where early-career researchers can share their ideas, receive constructive feedback, and build lasting collaborations.

Join us for an inspiring day of talks, discussions, and networking opportunities with leading researchers and peers who share your passion for machine learning!

Please refer to the event website for more details (https://newinml.github.io/)

Affinity Workshop

Muslims in ML

Ehsaneddin Asgari · Gasser Elbanna · Ahmed Youssef · Shaokai Yang · Muhammad Irfan khan · Kamran Soomro · Yousra Farhani · Azmine Toushik Wasi · Rian Adam Rajagede · Nasik Muhammad Nafi · Maryam Anwer

1:00 PM - 5:30 PM

The 5th Muslims in Machine Learning (MusIML) Workshop will be co-located with NeurIPS 2025 in San Diego, CA, USA.

The MusIML Workshop 2025 will be held in person and will feature invited speakers, oral presentations, and poster sessions. The event brings together Muslim researchers, engineers, and practitioners across academia and industry to connect, exchange ideas, and learn from one another. The workshop will also include mentoring opportunities and panel discussions on research directions, career development, and community engagement. Attendees of all backgrounds are welcome.

Tutorial

Autoregressive Models Beyond Language

Tianhong Li · Huiwen Chang · Kaiming He

1:30 PM - 4:00 PM

Autoregressive modeling is no longer confined to language. Recent work shows that the same next-element prediction principle can achieve state-of-the-art performance in generative modeling, representation learning, and multi-modal tasks across images, video, audio, robotics, and scientific data. Yet, extending autoregressive methods to these data is far from straightforward. Many inductive biases used in autoregressive language models no longer hold for other data modalities, and thus, many new techniques have been proposed in recent years to adapt autoregressive models to data beyond language.

This tutorial will review the core theory of autoregressive models, present practical design choices for generative modeling, representation learning, and multi-modal learning, and spotlight open challenges in this area. We hope our tutorial can provide the attendees with a clear conceptual roadmap and hands-on resources to apply and extend autoregressive techniques across diverse data domains.

Tutorial

Recent Developments in Geometric Machine Learning: Foundations, Models, and More

Behrooz Tahmasebi · Stefanie Jegelka

1:30 PM - 4:00 PM

Tutorial

Foundations of Imitation Learning: From Language Modeling to Continuous Control

Adam Block · Dylan Foster · Max Simchowitz

1:30 PM - 4:00 PM

Tutorial

The Science of Benchmarking: What’s Measured, What’s Missed, and What’s Next

Ziqiao Ma · Michael Saxon · Xiang Yue

1:30 PM - 4:00 PM

Tutorial

Theoretical Insights on Training Instability in Deep Learning

Jingfeng Wu · Yu-Xiang Wang · Maryam Fazel

1:30 PM - 4:00 PM

The advances in deep learning build on the dark arts of gradient-based optimization. In deep learning, the optimization process is oscillatory, spiky, and unstable. This makes little sense in classical optimization theory, which primarily operates in a well-behaved, stable regime. Yet, the best training configuration in practice constantly operates in an unstable regime. This tutorial introduces recent theoretical progress in understanding the benign nature of training instabilities, providing new insights from both optimization and statistical learning perspectives.

Tutorial

Scale Test-Time Compute on Modern Hardware

Zhuoming Chen · Beidi Chen · Azalia Mirhoseini

1:30 PM - 4:00 PM

Large language models have achieved significant breakthroughs in reasoning tasks, relying on the effective use of test-time compute. Techniques such as chain-of-thought and sampling-based strategies have shown that increasing test-time computation can dramatically enhance model performance. Our recent scaling law analyses highlight the critical role of test-time compute in enabling advanced reasoning, beyond what pretraining can offer. We also provide a practical analysis of hardware efficiency, revealing where bottlenecks arise and how they differ fundamentally from those in pretraining. Scaling test-time compute on modern hardware presents unique challenges. Compared to training workloads, test-time compute often exhibits low parallelism, irregular workload, frequent memory I/O, and dynamic execution paths, all of which make efficient deployment difficult. Therefore, practical scalability is often bottlenecked by system constraints, such as attention-related memory overheads and limited compute utilization. To address these challenges, the community has explored solutions across both systems and algorithms. On the system side, advancements include memory-efficient key-value cache management, optimized attention kernels, and scheduling mechanisms for adaptive resource allocation. On the algorithm side, emerging work has proposed model architectures and parallel generation paradigms that better align with hardware. This tutorial aims to provide a comprehensive overview of the landscape of scalable test-time compute. We will cover foundational challenges, review recent progress from both system and algorithm perspectives, and discuss principles for building solutions that are truly compatible with modern hardware. By bridging theory with deployment realities, we hope this tutorial will inspire and accelerate the development of practical, scalable LLM agent systems.

Tutorial

Data Privacy, Memorization, & Legal Implications in Generative AI: A Practical Guide

Pratyush Maini · Joseph C. Gratz · A. Feder Cooper

1:30 PM - 4:00 PM

Expo Talk Panel

Distributed Orthonormal Updates for Large-Scale Training

4:00 PM - 5:00 PM

We propose a 50-minute technical talk on recent advances in orthonormal update methods for large-scale AI model training. This topic is rapidly gaining attention in the community, emerging as a strong successor to AdamW following the success of orthonormal optimizers in training production-scale models such as Kimi-K2 and GLM-4.5.x000D
The talk will center on the design and practice of orthonormal updates, focusing on optimizers such as Muon and Dion. While we will briefly discuss their theoretical foundations, the emphasis will be on practical usage: how to integrate these optimizers into modern training pipelines, interpret their algorithmic components, and leverage the implementation guidelines provided in our open-source codebase at github.com/microsoft/dion.x000D
The talk is designed to engage both researchers and practitioners in the NeurIPS community:x000D
Academic perspective: presents a new class of optimizers grounded in theory along with how they interact with distributed training.x000D
Industrial perspective: highlights how orthonormal updates are implemented in practice and what best practices are.x000D
This topic lies at the intersection of optimization theory, scalable systems, and large-model training—an area of growing importance for both the research and applied machine learning communities.

Expo Talk Panel

Building an AI Ecosystem for Multiscale Biological Discovery

Dari Kimanius · Jonathan Schwartz

4:00 PM - 5:00 PM

Understanding biological systems requires resolving their structure and organization across scales, from tissues to individual molecules. Advances in imaging and molecular profiling now generate vast multimodal datasets that capture biological architecture and dynamics with unprecedented fidelity. Unlocking insights from this data demands computational approaches capable of linking observations across spatial, temporal, and molecular dimensions.

At the Chan Zuckerberg Imaging Institute (CZII), we are building the infrastructure, datasets, and community connections to enable this transformation. Our cryo-electron tomography (cryoET) processing pipeline supports high-throughput reconstruction and standardized metadata integration, forming the foundation for reproducible, machine-learning–ready datasets. The CryoET Data Portal (cryoetdataportal.czscience.com) provides open access to raw data, reconstructions, and curated annotations contributed by leading structural biology labs worldwide. Its programmatic API tools support segmentation, particle picking, and model benchmarking, creating a foundation for AI-driven structural discovery.

To catalyze progress in automated molecular identification, the CZ Imaging Institute recently organized a Kaggle challenge inviting participants to develop models for detecting and labeling macromolecular complexes in real-world cryoET data. Building on this success, upcoming challenges organized by the CZI & CZ Biohub Network will extend this approach to datasets spanning different biological scales, from tissue architecture and cellular organization to subcellular and molecular structure.

Together, these efforts form an open, interoperable ecosystem for machine learning in biological imaging. By combining standardized data infrastructure, scalable computation, and community-driven innovation, we aim to bridge the worlds of imaging and AI and accelerate the discovery of life’s organization across all scales.

Expo Talk Panel

Recent developments in embodied AI

Roland Memisevic

4:00 PM - 5:00 PM

Embodied AI is the study of systems that can perceive and interact with the physical world in real time. Real-world interactions pose unique challenges for AI systems since they naturally require a deep understanding of the physical world and/or its inhabitants. This understanding is often taken for granted in humans, where it is typically labelled as “intuitive physics” or “common sense”. It is widely agreed that solving this challenge would be as rewarding as it is hard, since it would be equivalent to creating truly capable “world models”, with countless applications in robotics, human-computer interaction, and even in advancing language modeling through concept grounding. Like other areas in AI, embodied AI has seen dramatic advances in recent years, fueled by the success of using pre-trained large language models as a central ingredient to allow for end-to-end training. While this development stands as one of many examples of the power of pre-trained language models, recently the converse has come true as well: embodied AI is increasingly being drawn on to understand real-world common sense and concept grounding in language models themselves, bringing back its early vision as a way to understand human-like cognition and world models.

This talk will provide an in-depth discussion of embodied AI, with a focus on recent advances based on multi-modal large language models. It will discuss how end-to-end training has made it possible to instill key aspects of real-world common sense in a model and how this had enabled highly ambitious use-cases, such as generalist (“common sense”) robot control and real-world visual interaction (“chatbots that can see and hear you”). The talk will also discuss practical considerations, such as streaming inference at the edge, end-to-end training data generation and the role of reinforcement learning, as well as open challenges in state tracking and long-term memory.

Expo Talk Panel

Agentic AI/RL

Daniel Han · Davide Testuggine · Joe Spisak · Sanyam Bhutani

4:00 PM - 5:00 PM

The transition from static language models to agentic AI systems driven by reinforcement learning (RL) places environments at the center of research and deployment. Environments provide the substrate where agents act, learn, and are evaluated—ranging from lightweight simulators and synthetic tasks to rich multi-agent ecosystems and real-world interfaces. Building and scaling these environments requires specialized tools and systems: standardized hubs for discovery and sharing, interfaces for reproducibility, and infrastructure that connects environments seamlessly to trainers, inference engines, and evaluation pipelines.

This workshop will highlight the tools, environments, and system innovations enabling the next generation of agentic AI. Topics will include scalable RL environment frameworks, benchmarks for safety and robustness, high-performance simulators optimized for heterogeneous hardware, and environment–trainer integration at scale. We will also explore how environments interface with large-model post-training workflows, providing the data and feedback loops necessary for reward shaping, alignment, and deployment in production systems.

By convening researchers, environment developers, and systems engineers, the workshop will create a venue to examine how environments, tools, and infrastructure together shape the future of agentic AI.

Expo Talk Panel

Building Foundational Models for Robotics at Tesla

Daniel Kurek

4:00 PM - 5:00 PM

Tesla's robots, both wheeled and legged, are developed with the goal of achieving general-purpose capability, analogous to the versatility observed in humans and animals. These systems rely primarily on scalable sensing modalities such as vision, audio etc, enabling robust performance within stringent power and cost constraints.

This talk will describe the principles and methodology behind constructing foundation models for robotics at Tesla. We will discuss the architecture, data and training of large-scale multimodal models that control these robots in an end-to-end pixels-to-actuation fashion. We will also examine evaluation protocols, safety considerations, and strategies for reliable real-world deployment. Finally, we project the transformational benefits to society that widespread deployment of such advanced robotic systems can deliver.

Expo Talk Panel

Ring-1T, Ring-linear and Ming-Flash-Omni: Scaling Knowledge-Enhanced Large Language Models for Reasoning and Efficiency

Han Peng · Yankun Ren · Liang Jiang · Richard Sikang Bian · JUN ZHOU

4:00 PM - 5:00 PM

The Ling 2.0 series represents a new generation of large language models designed around knowledge enhancement, reasoning efficiency, and scalable architecture innovation. Built upon trillion-scale sparse MoE foundations, Ling-1T achieves ~50B active parameters per token with FP8 mixed-precision pipelines and 1F1B interleaved scheduling, realizing over 40% training-throughput gains with negligible accuracy loss (<0.1%).x000D
This talk presents the technical journey behind Ling-mini, Ling-flash, and Ling-1T, focusing on (1) efficient large-scale training systems for trillion-parameter models; (2) the Ling Scaling Law and its implications for cross-domain reasoning; (3) hybrid attention and RL-based alignment strategies that enable both concise reasoning and long-context understanding; and (4) how these architectural and algorithmic advances empower industrial applications such as financial risk modeling and knowledge-grounded agents.x000D
We will conclude with open-sourced implementations (inclusionAI on Hugging Face and ModelScope) and future research directions toward trustworthy, efficient, and domain-enhanced LLMs.

Session 1 : Ring-1T: Scaling Reinforcement Learning for Trillion-Scale Thinking Model
Session 2: Ring-linear: An Efficient Hybrid Architecture for Long-Context Reasoning
Session 3: Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

Remarks

Welcome Remarks

5:30 PM - 6:00 PM

Reception

Welcome Reception

6:00 PM - 8:00 PM

Affinity Poster Session

Affinity Joint Poster Session

6:00 PM - 9:00 PM