Poster Session
San Diego Poster Session 3
Exhibit Hall C,D,E
Channel Matters: Estimating Channel Influence for Multivariate Time Series
Muyao Wang · Zeke Xie · Bo Chen · Hongwei Liu · James Kwok
The influence function serves as an efficient post-hoc interpretability tool that quantifies the impact of training data modifications on model parameters, enabling enhanced model performance, improved generalization, and interpretability insights without the need for expensive retraining processes. Recently, Multivariate Time Series (MTS) analysis has become an important yet challenging task, attracting significant attention. While channel extremely matters to MTS tasks, channel-centric methods are still largely under-explored for MTS. Particularly, no previous work studied the effects of channel information of MTS in order to explore counterfactual effects between these channels and model performance. To fill this gap, we propose a novel Channel-wise Influence (ChInf) method that is the first to estimate the influence of different channels in MTS. Based on ChInf, we naturally derived two channel-wise algorithms by incorporating ChInf into classic MTS tasks. Extensive experiments demonstrate the effectiveness of ChInf and ChInf-based methods in critical MTS analysis tasks, such as MTS anomaly detection and MTS data pruning. Specifically, our ChInf-based methods rank top-1 among all methods for comparison, while previous influence functions do not perform well on MTS anomaly detection tasks and MTS data pruning problem. This fully supports the superiority and necessity of ChInf.
A Regularized Newton Method for Nonconvex Optimization with Global and Local Complexity Guarantees
Yuhao Zhou · Jintao Xu · Bingrui Li · Chenglong Bao · Chao Ding · Jun Zhu
Finding an $\epsilon$-stationary point of a nonconvex function with a Lipschitz continuous Hessian is a central problem in optimization. Regularized Newton methods are a classical tool and have been studied extensively, yet they still face a trade‑off between global and local convergence. Whether a parameter-free algorithm of this type can simultaneously achieve optimal global complexity and quadratic local convergence remains an open question. To bridge this long-standing gap, we propose a new class of regularizers constructed from the current and previous gradients, and leverage the conjugate gradient approach with a negative curvature monitor to solve the regularized Newton equation. The proposed algorithm is adaptive, requiring no prior knowledge of the Hessian Lipschitz constant, and achieves a global complexity of $O(\epsilon^{-\frac{3}{2}})$ in terms of the second-order oracle calls, and $\tilde O(\epsilon^{-\frac{7}{4}})$ for Hessian-vector products, respectively. When the iterates converge to a point where the Hessian is positive definite, the method exhibits quadratic local convergence. Preliminary numerical results, including training the physics-informed neural networks, illustrate the competitiveness of our algorithm.
Per-Architecture Training-Free Metric Optimization for Neural Architecture Search
Mingzhuo Lin · Jianping Luo
Neural Architecture Search (NAS) aims to identify high-performance networks within a defined search space. Training-free metrics have been proposed to estimate network performance without actual training, reducing NAS deployment costs. However, individual training-free metrics often capture only partial architectural features, and their estimation capabilities are different in various tasks. Combining multiple training-free metrics has been explored to enhance scalability across tasks. Yet, these methods typically optimize global metric combinations over the entire search space, overlooking the varying sensitivities of different architectures to specific metrics, which may limit the final architectures' performance. To address these challenges, we propose the Per-Architecture Training-Free Metric Optimization NAS (PO-NAS) algorithm. This algorithm: (a) Integrates multiple training-free metrics as auxiliary scores, dynamically optimizing their combinations using limited real-time training data, without relying on benchmarks; (b) Individually optimizes metric combinations for each architecture; (c) Integrates an evolutionary algorithm that leverages efficient predictions from surrogate models, enhancing search efficiency in large search spaces. Notably, PO-NAS combines the efficiency of training-free search with the robust performance of training-based evaluations. Extensive experiments demonstrate the effectiveness of our approach. Our code has been made publicly available at https://anonymous.4open.science/r/PO-NAS-2953.
SDPGO: Efficient Self-Distillation Training Meets Proximal Gradient Optimization
Tongtong Su · Yun Liao · Fengbo Zheng
Self-knowledge distillation (SKD) enables single-model training by distilling knowledge from the model's own output, eliminating the need for a separate teacher network required in conventional distillation methods. However, current SKD methods focus mainly on replicating common features in the student model, neglecting the extraction of key features that significantly enhance student learning. Inspired by this, we devise a self-knowledge distillation framework entitled Self-Distillation training via Proximal Gradient Optimization or SDPGO, which utilizes gradient information to identify and assign greater weight to features that significantly impact classification performance, enabling the network to learn the most relevant features during training. Specifically, the proposed framework refines the gradient information into a dynamically changing weighting factor to evaluate the distillation knowledge via the dynamic weight adjustment scheme. Meanwhile, we devise the sequential iterative learning module to dynamically optimize knowledge transfer by leveraging historical predictions and real-time gradients, stabilizing training through mini-batch-based KL divergence refinement while adaptively prioritizing task-critical features for efficient self-distillation. Comprehensive experiments on image classification, object detection, and semantic segmentation demonstrate that our method consistently surpasses recent state-of-the-art knowledge distillation techniques. Code is available at: https://github.com/nanxiaotong/SDGPO.
Learning Memory-Enhanced Improvement Heuristics for Flexible Job Shop Scheduling
Jiaqi Wang · Zhiguang Cao · Peng Zhao · Rui Cao · YuBin Xiao · YUAN JIANG · You Zhou
The rise of smart manufacturing under Industry 4.0 introduces mass customization and dynamic production, demanding more advanced and flexible scheduling techniques. The flexible job-shop scheduling problem (FJSP) has attracted significant attention due to its complex constraints and strong alignment with real-world production scenarios. Current deep reinforcement learning (DRL)-based approaches to FJSP predominantly employ constructive methods. While effective, they often fall short of reaching (near-)optimal solutions. In contrast, improvement-based methods iteratively explore the neighborhood of initial solutions and are more effective in approaching optimality. However, the flexible machine allocation in FJSP poses significant challenges to the application of this framework, including accurate state representation, effective policy learning, and efficient search strategies. To address these challenges, this paper proposes a $\textbf{M}$emory-enhanced $\textbf{I}$mprovement $\textbf{S}$earch framework with he$\textbf{t}$erogeneous gr$\textbf{a}$ph $\textbf{r}$epresentation—$\textit{MIStar}$. It employs a novel heterogeneous disjunctive graph that explicitly models the operation sequences on machines to accurately represent scheduling solutions. Moreover, a memory-enhanced heterogeneous graph neural network (MHGNN) is designed for feature extraction, leveraging historical trajectories to enhance the decision-making capability of the policy network. Finally, a parallel greedy search strategy is adopted to explore the solution space, enabling superior solutions with fewer iterations. Extensive experiments on synthetic data and public benchmarks demonstrate that $\textit{MIStar}$ significantly outperforms both traditional handcrafted improvement heuristics and state-of-the-art DRL-based constructive methods.
PseuZO: Pseudo-Zeroth-Order Algorithm for Training Deep Neural Networks
Pengyun Yue · Xuanlin Yang · Mingqing Xiao · Zhouchen Lin
Zeroth-order Optimization (ZO) has received wide attention in machine learning, especially when computing full gradient is expensive or even impossible. Recently, ZO has emerged as an important paradigm for memory-efficient fine-tuning of large language models (LLMs), circumventing the memory overhead of backpropagation. However, existing ZO gradient estimators exhibit dimension-dependent variance scaling as $\Theta(d)$, leading to dimension-dependent convergence rates without further assumptions on the objective function, which is prohibitive for large-scale LLM parameters. To address this problem, we present a Pseudo-Zeroth-Order (PseuZO) framework for optimizing composite objective functions, especially large-scale models: $ \min_{\mathbf{x} \in \mathcal{X}} \mathcal{F}(\mathbf{x})= \bbE_{\mathbf{z}} g\circ h(\mathbf{x};\mathbf{z}) $, where $h$ represents complex, high-dimensional representations and $g$ is a task-specific loss. While existing zeroth-order methods estimate gradients with final loss functions, our PseuZO algorithm estimate the Jacobian matrix of $h(\mathbf{x})$ with the model output $\mathbf{o}= h(\mathbf{x})$, and the gradient of the loss function on model output $\mathbf{e} = \nabla_{\mathbf{o}} g(\mathbf{o})$, and apply exponential moving average on Jacobian estimators to reduce the variance. Moreover, we use the sliding window technique to reduce memory costs. Our algorithm achieves an $O( \max \lbrace \alpha_1 L\epsilon^{-2}, \alpha_1 L \sigma_2^2\epsilon^{-4} \rbrace )$ convergence rate, where $\alpha_1$ is the effective dimension of $\mathcal{F}$. Experimental results demonstrate that PseuZO outperforms MeZO and MeZO-SVRG in classification, multiple choice and generation tasks in both full-parameter and PEFT fine-tuning settings by boosting convergence in the early stages of training. For instance, under the same computation time, with respect to SST2 task, PesuZO gets 9.8\% higher accuracy than MeZO (91.2\% v.s. 82.4\%). With the sliding window technique, our PseuZO achieves $70\%\sim80\%$ memory reduction compared to FO-SGD for different model sizes as PseuZO only introduced a small dimension-independent memory overhead, which enables efficient scaling of the model size. The code is available at https://github.com/YangBigMn/PseuZO. $\newcommand{\bbE}{\mathbb{E}}$
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
Dmitriy Shopkhoev · Ammar Ali · Magauiya Zhussip · Valentin Malykh · Stamatios Lefkimmiatis · Nikos Komodakis · Sergey Zagoruyko
We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seam- lessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model’s performance on open benchmarks—without any training or healing steps, resulting in minimal computational overhead. We provide an open- source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe.
MeCeFO: Enhancing LLM Training Robustness via Fault-Tolerant Optimization
Rizhen Hu · Yutong He · Ran Yan · Mou Sun · Binhang Yuan · Kun Yuan
As distributed optimization scales to meet the demands of Large Language Model (LLM) training, hardware failures become increasingly non-negligible. Existing fault-tolerant training methods often introduce significant computational or memory overhead, demanding additional resources. To address this challenge, we propose **Me**mory- and **C**omputation- **e**fficient **F**ault-tolerant **O**ptimization (**MeCeFO**), a novel algorithm that ensures robust training with minimal overhead. When a computing node fails, MeCeFO seamlessly transfers its training task to a neighboring node while employing memory- and computation-efficient algorithmic optimizations to minimize the extra workload imposed on the neighboring node handling both tasks. MeCeFO leverages three key algorithmic designs: (i) Skip-connection, which drops the multi-head attention (MHA) module during backpropagation for memory- and computation-efficient approximation; (ii) Recomputation, which reduces activation memory in feedforward networks (FFNs); and (iii) Low-rank gradient approximation, enabling efficient estimation of FFN weight matrix gradients. Theoretically, MeCeFO matches the convergence rate of conventional distributed training, with a rate of $\mathcal{O}(1/\sqrt{nT})$, where $n$ is the data parallelism size and $T$ is the number of iterations. Empirically, MeCeFO maintains robust performance under high failure rates, incurring only a 4.18\% drop in throughput, demonstrating $5.0\times$ to $6.7\times$ greater resilience than previous SOTA approaches.
Real-Time Hyper-Personalized Generative AI Should Be Regulated to Prevent the Rise of "Digital Heroin"
Raad Khraishi · Cristovao Iglesias de Oliveira · Devesh Batra · Peter Gostev · Giulio Pelosio · Ramin Okhrati · Greig Cowan
This position paper argues that real-time generative AI has the potential to become the next wave of addictive digital media, creating a new class of digital content akin to ``digital heroin'' with severe implications for mental health and youth development. By shortening the content-generation feedback loop to mere seconds, these advanced models will soon be able to hyper-personalize outputs on the fly. When paired with misaligned incentives (e.g., maximizing user engagement), this will fuel unprecedented compulsive consumption patterns with far-reaching consequences for mental health, cognitive development, and social stability. Drawing on interdisciplinary research, from clinical observations of social media addiction to neuroscientific studies of dopamine-driven feedback, we illustrate how real-time tailored content generation may erode user autonomy, foment emotional distress, and disproportionately endanger vulnerable groups, such as adolescents. Due to the rapid advancement of generative AI and its potential to induce severe addiction-like effects, we call for strong government oversight akin to existing controls on addictive substances, particularly for minors. We further urge the machine learning community to act proactively by establishing robust design guidelines, collaborating with public health experts, and supporting targeted policy measures to ensure responsible and ethical deployment, rather than paving the way for another wave of unregulated digital dependence.
Rigor in AI: Doing Rigorous AI Work Requires a Broader, Responsible AI-Informed Conception of Rigor
Alexandra Olteanu · Su Lin Blodgett · Agathe Balayn · Angelina Wang · Fernando Diaz · Flavio Calmon · Margaret Mitchell · Michael Ekstrand · Reuben Binns · Solon Barocas
In AI research and practice, rigor remains largely understood in terms of methodological rigor---such as whether mathematical, statistical, or computational methods are correctly applied. We argue that this narrow conception of rigor has contributed to the concerns raised by the responsible AI community, including overblown claims about the capabilities of AI systems. Our position is that a broader conception of what rigorous AI research and practice should entail is needed. We believe such a conception---in addition to a more expansive understanding of 1) methodological rigor---should include aspects related to 2) what background knowledge informs what to work on (epistemic rigor); 3) how disciplinary, community, or personal norms, standards, or beliefs influence the work (normative rigor); 4) how clearly articulated the theoretical constructs under use are (conceptual rigor); 5) what is reported and how (reporting rigor); and 6) how well-supported the inferences from existing evidence are (interpretative rigor). In doing so, we also provide useful language and a framework for much needed dialogue about the AI community's work by researchers, policymakers, journalists, and other stakeholders.
World Models Should Prioritize the Unification of Physical and Social Dynamics
Xiaoyuan Zhang · Chengdong Ma · Yizhe Huang · Weidong Huang · Siyuan Qi · Song-Chun Zhu · Xue Feng · Yaodong Yang
World models, which explicitly learn environmental dynamics to lay the foundation for planning, reasoning, and decision-making, are rapidly advancing in predicting both physical dynamics and aspects of social behavior, yet predominantly in separate silos. This division results in a systemic failure to model the crucial interplay between physical environments and social constructs, rendering current models fundamentally incapable of adequately addressing the true complexity of real-world systems where physical and social realities are inextricably intertwined. This position paper argues that the systematic, bidirectional unification of physical and social predictive capabilities is the next crucial frontier for world model development. We contend that comprehensive world models must holistically integrate objective physical laws with the subjective, evolving, and context-dependent nature of social dynamics. Such unification is paramount for AI to robustly navigate complex real-world challenges and achieve more generalizable intelligence. This paper substantiates this imperative by analyzing core impediments to integration, proposing foundational guiding principles (ACE Principles), and outlining a conceptual framework alongside a research roadmap towards truly holistic world models.
KAIROS: Scalable Model-Agnostic Data Valuation
Jiongli Zhu · Parjanya Prashant · Alex Cloninger · Babak Salimi
Data valuation techniques quantify each training example's contribution to model performance, providing a principled basis for data cleaning, acquisition, and selection. Existing valuation methods remain inadequate: \emph{model-based} techniques depend on a single fitted model and inherit its biases, while \emph{algorithm-based} approaches like Data Shapley scale poorly due to their need to train multiple models. Recent work has proposed model-agnostic alternatives based on Wasserstein distance between the training set and a clean reference set, but exact computation is expensive and approximations often misrank examples. We introduce KAIROS, a model-agnostic framework that values examples by their contribution to the Maximum Mean Discrepancy (MMD) between the training set and a clean reference distribution. Unlike Wasserstein methods, MMD admits a closed-form solution that requires no approximations and is scalable to large datasets. Additionally, KAIROS enables efficient online valuation: adding a new batch of $m$ examples requires only $O(mN)$ computation to update all scores, compared to $O(N^2)$ in prior work where $N$ is the training set size. Empirical evaluations on noise, mislabeling, and poisoning benchmarks show that KAIROS consistently outperforms state-of-the-art baselines in both accuracy and runtime. On ImageNet, KAIROS achieves up to 15 $\times$ speedup over the fastest baseline while maintaining superior data valuation quality. Our results demonstrate that model-agnostic methods can match or exceed model-based approaches in performance while scaling to large datasets.
Knowledge Distillation Detection for Open-weights Models
Qin Shi · Amber Yijia Zheng · Qifan Song · Raymond A. Yeh
We propose the task of knowledge distillation detection, which aims to determine whether a student model has been distilled from a given teacher, under a practical setting where only the student’s weights and the teacher’s API are available. This problem is motivated by growing concerns about model provenance and unauthorized replication through distillation. To address this task, we introduce a model-agnostic framework that combines data-free input synthesis and statistical score computation for detecting distillation. Our approach is applicable to both classification and generative models. Experiments on diverse architectures for image classification and text-to-image generation show that our method improves detection accuracy over the strongest baselines by 59.6\% on CIFAR-10, 71.2\% on ImageNet, and 20.0\% for text-to-image generation. The code is available at https://github.com/shqii1j/distillation_detection.
Incentivizing Desirable Effort Profiles in Strategic Classification: The Role of Causality and Uncertainty
Valia Efthymiou · Chara Podimata · Diptangshu Sen · Juba Ziani
We study strategic classification in binary decision-making settings where agents can modify their features in order to improve their classification outcomes. Importantly, our work considers the causal structure across different features, acknowledging that effort in one feature may affect other features. The main goal of our work is to understand when and how much agent effort is invested towards desirable features, and how this is influenced by the deployed classifier, the causal structure of the agent's features, their ability to modify them, and the information available to the agent about the classifier and the feature causal graph. We characterize conditions under which agents with full information about the causal structure and the principal's classifier align with the principal's goals of incentivizing effort mostly in ``desirable'' features, and identify cases where designing such classifiers (from the principal's side) is still tractable despite general non-convexity. Under incomplete information, we show that uncertainty leads agents to prioritize features with high expected impact and low variance, which may often be misaligned with the principal's goals. Finally, using numerical experiments based on a cardiovascular disease risk study, we illustrate how to incentivize desirable modifications even under uncertainty.
The Future Unmarked: Watermark Removal in AI-Generated Images via Next-Frame Prediction
Huming Qiu · Zhaoxiang Wang · Mi Zhang · Xiaohan Zhang · Xiaoyu You · Min Yang
Image watermarking embeds imperceptible signals into AI-generated images for deepfake detection and provenance verification. Although recent semantic-level watermarking methods demonstrate strong resistance against conventional pixel-level removal attacks, their robustness against more advanced removal strategies remains underexplored, raising concerns about their reliability in practical scenarios. Existing removal attacks primarily operate in the pixel domain without altering image semantics, which limits their effectiveness against semantic-level watermarks. In this paper, we propose Next Frame Prediction Attack (NFPA), the first semantic-level removal attack. Unlike pixel-level attacks, NFPA formulates watermark removal as a video generation task: it treats the watermarked image as the initial frame and aims to subtly manipulate the image semantics to generate the next-frame image, i.e., the unwatermarked image. We conduct a comprehensive evaluation on eight state-of-the-art image watermarking schemes, demonstrating that NFPA consistently outperforms thirteen removal attack baselines in terms of the trade-off between watermark removal and image quality. Our results reveal the vulnerabilities of current image watermarking methods and highlight the urgent need for more robust watermarks.
A Statistical Framework of Watermarks for Large Language Models: Pivot, Detection Efficiency and Optimal Rules
Xiang Li · Feng Ruan · Huiyuan Wang · Qi Long · Weijie Su
Since ChatGPT was introduced in November 2022, embedding (nearly) unnoticeable statistical signals into text generated by large language models (LLMs), also known as watermarking, has been used as a principled approach to provable detection of LLM-generated text from its human-written counterpart. In this paper, we introduce a general and flexible framework for reasoning about the statistical efficiency of watermarks and designing powerful detection rules. Inspired by the hypothesis testing formulation of watermark detection, our framework starts by selecting a pivotal statistic of the text and a secret key---provided by the LLM to the verifier---to control the false positive rate (the error of mistakenly detecting human-written text as LLM-generated). Next, this framework allows one to evaluate the power of watermark detection rules by obtaining a closed-form expression of the asymptotic false negative rate (the error of incorrectly classifying LLM-generated text as human-written). Our framework further reduces the problem of determining the optimal detection rule to solving a minimax optimization program. We apply this framework to two representative watermarks---one of which has been internally implemented at OpenAI---and obtain several findings that can be instrumental in guiding the practice of implementing watermarks. In particular, we derive optimal detection rules for these watermarks under our framework. These theoretically derived detection rules are demonstrated to be competitive and sometimes enjoy a higher power than existing detection approaches through numerical experiments.
We Should Chart an Atlas of All the World's Models
Eliahu Horwitz · Nitzan Kurer · Jonathan Kahana · Liel Amar · Yedid Hoshen
Public model repositories now contain millions of models, yet most remain undocumented and effectively lost: their capabilities, provenance, and constraints cannot be reliably determined. As a result, the field wastes training time and compute, propagates hidden biases, faces intellectual-property risks, and misses opportunities for model reuse and transfer. In this position paper, we advocate charting the world's model population in a unified structure we call the Model Atlas: a graph that captures models, their attributes, and the weight transformations connecting them. The Model Atlas enables applications in model forensics, meta-ML research, and model discovery, challenging tasks given today's unstructured model repositories. However, because most models lack documentation, large atlas regions remain uncharted. Addressing this gap motivates new machine learning methods that treat models themselves as data and infer properties such as functionality, performance, and lineage directly from their weights. We argue that a scalable path forward is to bypass the unique parameter symmetries that plague model weights. Charting all the world's models will require a community effort, and we hope its broad utility will rally researchers toward this goal.
Military AI Needs Technically-Informed Regulation to Safeguard AI Research and its Applications
Riley Simmons-Edler · Jean Dong · Paul Lushenko · Kanaka Rajan · Ryan Badman
Military weapon systems and command-and-control infrastructure augmented by artificial intelligence (AI) have seen rapid development and deployment in recent years. However, the sociotechnical impacts of AI on combat systems, military decision-making, and the norms of warfare have been understudied. We focus on a specific subset of lethal autonomous weapon systems (LAWS) that use AI for targeting or battlefield decisions. We refer to this subset as AI-powered lethal autonomous weapon systems (AI-LAWS) and argue that they introduce novel risks—including unanticipated escalation, poor reliability in unfamiliar environments, and erosion of human oversight—all of which threaten both military effectiveness and the openness of AI research.These risks cannot be addressed by high-level policy alone; effective regulation must be grounded in the technical behavior of AI models. We argue that AI researchers must be involved throughout the regulatory lifecycle.Thus, we propose a clear, behavior-based definition of AI-LAWS—systems that introduce unique risks through their use of modern AI—as a foundation for technically grounded regulation, given that existing frameworks do not distinguish them from conventional LAWS.Using this definition, we propose several technically-informed policy directions and invite greater participation from the AI research community in military AI policy discussions.
BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks
Anna Sokol · Elizabeth Daly · Michael Hind · David Piorkowski · Xiangliang Zhang · Nuno Moniz · Nitesh Chawla
Large language models (LLMs) are powerful tools capable of handling diverse tasks. Comparing and selecting appropriate LLMs for specific tasks requires systematic evaluation methods, as models exhibit varying capabilities across different domains. However, finding suitable benchmarks is difficult given the many available options. This complexity not only increases the risk of benchmark misuse and misinterpretation but also demands substantial effort from LLM users, seeking the most suitable benchmarks for their specific needs. To address these issues, we introduce BenchmarkCards, an intuitive and validated documentation framework that standardizes critical benchmark attributes such as objectives, methodologies, data sources, and limitations. Through user studies involving benchmark creators and users, we show that BenchmarkCards can simplify benchmark selection and enhance transparency, facilitating informed decision-making in evaluating LLMs.Data & Code:github.com/SokolAnn/BenchmarkCards huggingface.co/datasets/ASokol/BenchmarkCards
Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models
Dilxat Muhtar · Enzhuo Zhang · Zhenshi Li · Feng Gu · Yanglangxing He · Pengfeng Xiao · Xueliang Zhang
Vision-Language Models (VLMs) have demonstrated great potential in interpreting remote sensing (RS) images through language-guided semantic. However, the effectiveness of these VLMs critically depends on high-quality image-text training data that captures rich semantic relationships between visual content and language descriptions. Unlike natural images, RS lacks large-scale interleaved image-text pairs from web data, making data collection challenging. While current approaches rely primarily on rule-based methods or flagship VLMs for data synthesis, a systematic framework for automated quality assessment of such synthetically generated RS vision-language data is notably absent. To fill this gap, we propose a novel score model trained on large-scale RS vision-language preference data for automated quality assessment. Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs (e.g., Qwen2-VL) with the top 30% of data ranked by our score model achieves superior accuracy compared to both full-data fine-tuning and CLIP-score-based ranking approaches. Furthermore, we demonstrate applications of our scoring model for reinforcement learning (RL) training and best-of-N (BoN) test-time scaling, enabling significant improvements in VLM performance for RS tasks. Our code, model, and dataset are publicly available.
Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
Dylan Sam · Ayan Chakrabarti · Afshin Rostamizadeh · Srikumar Ramalingam · Gui Citovsky · Sanjiv Kumar
Measuring similarity between training examples is critical for curating high-quality and diverse pretraining datasets for language models. However, similarity is typically computed with a generic off-the-shelf embedding model that has been trained for tasks such as retrieval. Whether these embedding-based similarity metrics are well-suited for pretraining data selection remains largely unexplored. In this paper, we propose a new framework to assess the suitability of a similarity metric specifically for data curation in language model pretraining applications. Our framework's first evaluation criterion captures how well distances reflect generalization in pretraining loss between different training examples. Next, we use each embedding model to guide a standard diversity-based data curation algorithm and measure its utility by pretraining a language model on the selected data and evaluating downstream task performance. Finally, we evaluate the capabilities of embeddings to distinguish between examples from different data sources. With these evaluations, we demonstrate that standard off-the-shelf embedding models are not well-suited for the pretraining data curation setting, underperforming even remarkably simple embeddings that are extracted from models trained on the same pretraining corpus. Our experiments are performed on the Pile, for pretraining a 1.7B parameter language model on 200B tokens. We believe our analysis and evaluation framework serves as a foundation for the future design of embeddings that specifically reason about similarity in pretraining datasets.
EnzyControl: Adding Functional and Substrate-Specific Control for Enzyme Backbone Generation
Chao Song · ZHIYUAN LIU · Han Huang · Liang Wang · Qiong Wang · Jian-Yu Shi · Hui Yu · Yihang Zhou · Yang Zhang
Designing enzyme backbones with substrate-specific functionality is a critical challenge in computational protein engineering. Current generative models excel in protein design but face limitations in binding data, substrate-specific control, and flexibility for de novo enzyme backbone generation. To address this, we introduce EnzyBind, a dataset with 11,100 experimentally validated enzyme-substrate pairs specifically curated from PDBbind. Building on this, we propose EnzyControl, a method that enables functional and substrate-specific control in enzyme backbone generation. Our approach generates enzyme backbones conditioned on MSA-annotated catalytic sites and their corresponding substrates, which are automatically extracted from curated enzyme-substrate data. At the core of EnzyControl is EnzyAdapter, a lightweight, modular component integrated into a pretrained motif-scaffolding model, allowing it to become substrate-aware. A two-stage training paradigm further refines the model's ability to generate accurate and functional enzyme structures. Experiments show that our EnzyControl achieves the best performance across structural and functional metrics on EnzyBind and EnzyBench benchmarks, with particularly notable improvements of 13% in designability and 13% in catalytic efficiency compared to the baseline models. The code is released at https://github.com/Vecteur-libre/EnzyControl.
Graph Data Selection for Domain Adaptation: A Model-Free Approach
Ting-Wei Li · Ruizhong Qiu · Hanghang Tong
Graph domain adaptation (GDA) is a fundamental task in graph machine learning, with techniques like shift-robust graph neural networks (GNNs) and specialized training procedures to tackle the distribution shift problem. Although these model-centric approaches show promising results, they often struggle with severe shifts and constrained computational resources. To address these challenges, we propose a novel model-free framework, GRADATE (GRAph DATa sElection), that selects the best training data from the source domain for the classification task on the target domain. GRADATE picks training samples without relying on any GNN model’s predictions or training recipes, leveraging optimal transport theory to capture and adapt to distribution changes. GRADATE is data-efficient, scalable and meanwhile complements existing model-centric GDA approaches. Through comprehensive empirical studies on several real-world graph-level datasets and multiple covariate shift types, we demonstrate that GRADATE outperforms existing selection methods and enhances off-the-shelf GDA methods with much fewer training data.
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Ibragim Badertdinov · Alexander Golubev · Maksim Nekrashevich · Anton Shevtsov · Simon Karasik · Andrei Andriushchenko · Maria Trofimova · Daria Litvintseva · Boris Yangel
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.
Measuring what Matters: Construct Validity in Large Language Model Benchmarks
Andrew M. Bean · Ryan Othniel Kearns · Angelika Romanou · Franziska Sofia Hafner · Harry Mayne · Jan Batzner · Negar Foroutan Eghlidi · Chris Schmitz · Karolina Korgul · Hunar Batra · Oishi Deb · Emma Beharry · Cornelius Emde · Thomas Foster · Anna Gausen · María Grandury · Sophia Han · Valentin Hofmann · Lujain Ibrahim · Hazel Kim · Hannah Rose Kirk · Fangru Lin · Gabrielle Liu · Lennart Luettgau · Jabez Magomere · Jonathan Rystrøm · Anna Sotnikova · Yushi Yang · Yilun Zhao · Adel Bibi · Antoine Bosselut · Ronald Clark · Arman Cohan · Jakob Foerster · Yarin Gal · Scott Hale · Deborah Raji · Christopher Summerfield · Philip Torr · Cozmin Ududec · Luc Rocher · Adam Mahdi
Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as safety' androbustness' requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks.
Measuring Fingerprints of Web-filtered Text Datasets and Fingerprint Propagation Through Training
Youssef Mansour · Reinhard Heckel
We investigate fingerprints in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of fingerprints or biases in popular computer vision datasets, we analyze popular open-source pretraining datasets for LLMs derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb, and DCLM-Baseline. Despite those datasets being obtained with similar curation steps, neural networks can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that small differences in filtering and processing pipelines induce fingerprints, that we find are evident in formatting, vocabulary, and content distributions. Such fingerprints can negatively impact cross-dataset generalization. Additionally, we show that these fingerprints propagate through training: sequences generated by models trained on those datasets can be accurately classified by a classifier trained on the original datasets. This can offer insights into data characteristics that are typically undisclosed by LLM developers, including pretraining mixture proportions and finetuning data sources.
AGI-Elo: How Far Are We From Mastering A Task?
Shuo Sun · Yimin Zhao · Christina Lee · JIAWEI SUN · Chengran Yuan · Zefan Huang · Dongen Li · Justin Yeoh · Alok Prakash · Thomas Malone · Marcelo Ang Jr
As the field progresses toward Artificial General Intelligence (AGI), there is a pressing need for more comprehensive and insightful evaluation frameworks that go beyond aggregate performance metrics. This paper introduces a unified rating system that jointly models the difficulty of individual test cases and the competency of AI models (or humans) across vision, language, and action domains. Unlike existing metrics that focus solely on models, our approach allows for fine-grained, difficulty-aware evaluations through competitive interactions between models and tasks, capturing both the long-tail distribution of real-world challenges and the competency gap between current models and full task mastery. We validate the generalizability and robustness of our system through extensive experiments on multiple established datasets and models across distinct AGI domains. The resulting rating distributions offer novel perspectives and interpretable insights into task difficulty, model progression, and the outstanding challenges that remain on the path to achieving full AGI task mastery. We have made our code and results publicly available at https://ss47816.github.io/AGI-Elo/.
Why Do Multi-Agent LLM Systems Fail?
Mert Cemri · Melissa Z Pan · Shuyi Yang · Lakshya A Agrawal · Bhavya Chopra · Rishabh Tiwari · Kurt Keutzer · Aditya Parameswaran · Dan Klein · Kannan Ramchandran · Matei A Zaharia · Joseph Gonzalez · Ion Stoica
Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators andvalidated by high inter-annotator agreement (κ = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.
Dense SAE Latents Are Features, Not Bugs
Xiaoqing Sun · Alessandro Stolfo · Joshua Engels · Ben Wu · Senthooran Rajamanoharan · Mrinmaya Sachan · Max Tegmark
Sparse autoencoders (SAEs) are designed to extract interpretable features from language models by enforcing a sparsity constraint. Ideally, training an SAE would yield latents that are both sparse and semantically meaningful. However, many SAE latents activate frequently (i.e., are dense), raising concerns that they may be undesirable artifacts of the training procedure. In this work, we systematically investigate the geometry, function, and origin of dense latents and show that they are not only persistent but often reflect meaningful model representations. We first demonstrate that dense latents tend to form antipodal pairs that reconstruct specific directions in the residual stream, and that ablating their subspace suppresses the emergence of new dense features in retrained SAEs---suggesting that high density features are an intrinsic property of the residual space. We then introduce a taxonomy of dense latents, identifying classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction. Finally, we analyze how these features evolve across layers, revealing a shift from structural features in early layers, to semantic features in mid layers, and final to output-oriented signals in the last layers of the model. Our findings indicate that dense latents serve functional roles in language model computation and should not be dismissed as training noise.
EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification
Lin Zhang · Wenshuo Dong · Zhuoran Zhang · Shu Yang · Lijie Hu · Ninghao Liu · Pan Zhou · Di Wang
Understanding the internal mechanisms of transformer-based language models remains challenging. Mechanistic interpretability based on circuit discovery aims to reverse engineer neural networks by analyzing their internal processes at the level of computational subgraphs. In this paper, we revisit existing gradient-based circuit identification methods and find that their performance is either affected by the zero-gradient problem or saturation effects, where edge attribution scores become insensitive to input changes, resulting in noisy and unreliable attribution evaluations for circuit components. To address the saturation effect, we propose Edge Attribution Patching with GradPath (EAP-GP), EAP-GP introduces an integration path, starting from the input and adaptively following the direction of the difference between the gradients of corrupted and clean inputs to avoid the saturated region. This approach enhances attribution reliability and improves the faithfulness of circuit identification. We evaluate EAP-GP on 6 datasets using GPT-2 Small, GPT-2 Medium, and GPT-2 XL. Experimental results demonstrate that EAP-GP outperforms existing methods in circuit faithfulness, achieving improvements up to 17.7\%. Comparisons with manually annotated ground-truth circuits demonstrate that EAP-GP achieves precision and recall comparable to or better than previous approaches, highlighting its effectiveness in identifying accurate circuits.
Collective Counterfactual Explanations: Balancing Individual Goals and Collective Dynamics
Ahmad-Reza Ehyaei · Ali Shirali · Samira Samadi
Counterfactual explanations provide individuals with cost-optimal recommendations to achieve their desired outcomes. However, when a significant number of individuals seek similar state modifications, this individual-centric approach can inadvertently create competition and introduce unforeseen costs. Additionally, disregarding the underlying data distribution may lead to recommendations that individuals perceive as unusual or impractical. To address these challenges, we propose a novel framework that extends standard counterfactual explanations by incorporating a population dynamics model. This framework penalizes deviations from equilibrium after individuals follow the recommendations, effectively mitigating externalities caused by correlated changes across the population. By balancing individual modification costs with their impact on others, our method ensures a more equitable and efficient outcome. We show how this approach reframes the counterfactual explanation problem from an individual-centric task to a collective optimization problem. Augmenting our theoretical insights, we design and implement scalable algorithms for computing collective counterfactuals, showcasing their effectiveness and advantages over existing recourse methods, particularly in aligning with collective objectives.
CHiQPM: Calibrated Hierarchical Interpretable Image Classification
Thomas Norrenbrock · Timo Kaiser · Sovan Biswas · Neslihan Kose · Ramesh Manuvinakurike · Bodo Rosenhahn
Globally interpretable models are a promising approach for trustworthy AI in safety-critical domains. Alongside global explanations, detailed local explanations are a crucial complement to effectively support human experts during inference. This work proposes the Calibrated Hierarchical QPM (CHiQPM) which offers uniquely comprehensive global and local interpretability, paving the way for human-AI complementarity. CHiQPM achieves superior global interpretability by contrastively explaining the majority of classes and offers novel hierarchical explanations that are more similar to how humans reason and can be traversed to offer a built-in interpretable Conformal prediction (CP) method. Our comprehensive evaluation shows that CHiQPM achieves state-of-the-art accuracy as a point predictor, maintaining 99% accuracy of non-interpretable models. This demonstrates a substantial improvement, where interpretability is incorporated without sacrificing overall accuracy. Furthermore, its calibrated set prediction is competitively efficient to other CP methods, while providing interpretable predictions of coherent sets along its hierarchical explanation.
Layer-Wise Modality Decomposition for Interpretable Multimodal Sensor Fusion
Jaehyun Park · Konyul Park · Daehun Kim · Junseo Park · Jun Won Choi
In autonomous driving, transparency in the decision-making of perception models is critical, as even a single misperception can be catastrophic. Yet with multi-sensor inputs, it is difficult to determine how each modality contributes to a prediction because sensor information becomes entangled within the fusion network. We introduce Layer-Wise Modality Decomposition (LMD), a post-hoc, model-agnostic interpretability method that disentangles modality-specific information across all layers of a pretrained fusion model. To our knowledge, LMD is the first approach to attribute the predictions of a perception model to individual input modalities in a sensor-fusion system for autonomous driving. We evaluate LMD on pretrained fusion models under camera–radar, camera–LiDAR, and camera–radar–LiDAR settings for autonomous driving. Its effectiveness is validated using structured perturbation-based metrics and modality-wise visual decompositions, demonstrating practical applicability to interpreting high-capacity multimodal architectures. Code is available at https://github.com/detxter-jvb/Layer-Wise-Modality-Decomposition.
Angular Steering: Behavior Control via Rotation in Activation Space
Minh Hieu Vu · Tan Nguyen
Controlling specific behaviors in large language models while preserving their general capabilities is a central challenge for safe and reliable artificial intelligence deployment. Current steering methods, such as vector addition and directional ablation, are constrained within a two-dimensional subspace defined by the activation and feature direction, making them sensitive to chosen parameters and potentially affecting unrelated features due to unintended interactions in activation space. We introduce Angular Steering, a novel and flexible method for behavior modulation that operates by rotating activations within a fixed two-dimensional subspace. By formulating steering as a geometric rotation toward or away from a target behavior direction, Angular Steering provides continuous, fine-grained control over behaviors such as refusal and compliance. We demonstrate this method using refusal steering emotion steering as use cases. Additionally, we propose Adaptive Angular Steering, a selective variant that rotates only activations aligned with the target feature, further enhancing stability and coherence. Angular Steering generalizes existing addition and orthogonalization techniques under a unified geometric rotation framework, simplifying parameter selection and maintaining model stability across a broader range of adjustments. Experiments across multiple model families and sizes show that Angular Steering achieves robust behavioral control while maintaining general language modeling performance, underscoring its flexibility, generalization, and robustness compared to prior approaches. Code and artifacts are available at \url{https://github.com/lone17/angular-steering/}.
Transferring Linear Features Across Language Models With Model Stitching
Alan Chen · Jack Merullo · Alessandro Stolfo · Ellie Pavlick
In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the \textit{weights} of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn highly similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. For example, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped. Overall, our findings illustrate similarities and differences in the linear representation spaces of small and large models and demonstrate a method for improving the training efficiency of SAEs.
A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values
Tyler Chen · Akshay Seshadri · Mattia Jacopo Villani · Pradeep Niroula · Shouvanik Chakrabarti · Archan Ray · Pranav Deshpande · Romina Yalovetzky · Marco Pistoia · Niraj Kumar
Shapley values have emerged as a critical tool for explaining which features impact the decisions made by machine learning models. However, computing exact Shapley values is difficult, generally requiring an exponential (in the feature dimension) number of model evaluations. To address this, many model-agnostic randomized estimators have been developed, the most influential and widely used being the KernelSHAP method (Lundberg & Lee, 2017). While related estimators such as unbiased KernelSHAP (Covert & Lee, 2021) and LeverageSHAP (Musco & Witter, 2025) are known to satisfy theoretical guarantees, bounds for KernelSHAP have remained elusive. We describe a broad and unified framework that encompasses KernelSHAP and related estimators constructed using both with and without replacement sampling strategies. We then prove strong non-asymptotic theoretical guarantees that apply to all estimators from our framework. This provides, to the best of our knowledge, the first theoretical guarantees for KernelSHAP and sheds further light on tradeoffs between existing estimators. Through comprehensive benchmarking on small and medium dimensional datasets for Decision-Tree models, we validate our approach against exact Shapley values, consistently achieving low mean squared error with modest sample sizes. Furthermore, we make specific implementation improvements to enable scalability of our methods to high-dimensional datasets. Our methods, tested on datasets such MNIST and CIFAR10, provide consistently better results compared to the KernelSHAP library.
Contimask: Explaining Irregular Time Series via Perturbations in Continuous Time
Max Moebus · Björn Braun · Christian Holz
Explaining black-box models for time series data is critical for the wide-scale adoption of deep learning techniques across domains such as healthcare. Recently, explainability methods for deep time series models have seen significant progress by adopting saliency methods that perturb masked segments of time series to uncover their importance towards the prediction of black-box models. Thus far, such methods have been largely restricted to regular time series. Irregular time series, however, sampled at irregular time intervals and potentially with missing values, are the dominant form of time series in various critical domains (e.g., hospital records). In this paper, we conduct the first evaluation of saliency methods for the interpretation of irregular time series models. We first translate techniques for regular time series into the continuous time realm of irregular time series and show under which circumstances such techniques are still applicable. However, existing perturbation techniques neglect the timing and structure of observed data, e.g., informative missingness when data is not missing at random. Thus, we propose Contimask, a simple framework to also apply non-differentiable perturbations, such as simulating that parts of the data had not been observed using NeuroEvolution. Doing so, we successfully detect how structural differences in the data can bias irregular time series models on a real-world sepsis prediction task where 90% of the data is missing. Source code is available on GitHub.
ElliCE: Efficient and Provably Robust Algorithmic Recourse via the Rashomon Sets
Bohdan Turbal · Iryna Voitsitska · Lesia Semenova
Machine learning models now influence decisions that directly affect people’s lives, making it important to understand not only their predictions, but also how individuals could act to obtain better results. Algorithmic recourse provides actionable input modifications to achieve more favorable outcomes, typically relying on counterfactual explanations to suggest such changes. However, when the Rashomon set - the set of near-optimal models - is large, standard counterfactual explanations can become unreliable, as a recourse action valid for one model may fail under another. We introduce ElliCE, a novel framework for robust algorithmic recourse that optimizes counterfactuals over an ellipsoidal approximation of the Rashomon set. The resulting explanations are provably valid over this ellipsoid, with theoretical guarantees on uniqueness, stability, and alignment with key feature directions. Empirically, ElliCE generates counterfactuals that are not only more robust but also more flexible, adapting to user-specified features constraints while being substantially faster than existing baselines. This provides a principled and practical solution for reliable recourse under model uncertainty, ensuring stable recommendations for users even as models evolve.
DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models
Cathy Jiao · Yijun Pan · Emily Xiao · Daisy Sheng · Niket Jain · Hanzhang Zhao · Ishita Dasgupta · Jiaqi Ma · Chenyan Xiong
Data attribution methods quantify the influence of training data on model outputs and are becoming increasingly relevant for a wide range of LLM research and applications, including dataset curation, model interpretability, data valuation. However, there remain critical gaps in systematic LLM-centric evaluation of data attribution methods. To this end, we introduce DATE-LM (Data Attribution Evaluation in Language Models), a unified benchmark for evaluating data attribution methods through real-world LLM applications. DATE-LM measures attribution quality through three key tasks — training data selection, toxicity/bias filtering, and factual attribution. Our benchmark is designed for ease of use, enabling researchers to configure and run large-scale evaluations across diverse tasks and LLM architectures. Furthermore, we use DATE-LM to conduct a large-scale evaluation of existing data attribution methods. Our findings show that no single method dominates across all tasks, data attribution methods have trade-offs with simpler baselines, and method performance is sensitive to task-specific evaluation design. Finally, we release a public leaderboard for quick comparison of methods and to facilitate community engagement. We hope DATE-LM serves as a foundation for future data attribution research in LLMs.
FLAME: Fast Long-context Adaptive Memory for Event-based Vision
Biswadeep Chakraborty · Saibal Mukhopadhyay
We propose Fast Long-range Adaptive Memory for Event (FLAME), a novel scalable architecture that combines neuro-inspired feature extraction with robust structured sequence modeling to efficiently process asynchronous and sparse event camera data. As a departure from conventional input encoding methods, FLAME presents Event Attention Layer, a novel feature extractor that leverages neuromorphic dynamics (Leaky Integrate-and-Fire (LIF)) to directly capture multi-timescale features from event streams. The feature extractor is integrates with a structured state-space model with a novel Event-Aware HiPPO (EA-HiPPO) mechanism that dynamically adapts memory retention based on inter-event intervals to understand relationship across varying temporal scales and event sequences. A Normal Plus Low Rank (NPLR) decomposition reduces the computational complexity of state update from $\mathcal{O}(N^2)$ to $\mathcal{O}(Nr)$, where $N$ represents the dimension of the core state vector and $r$ is the rank of a low-rank component (with $r \ll N$). FLAME demonstrates state-of-the-art accuracy for event-by-event processing on complex event camera datasets.
Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
Yaniv Nikankin · Dana Arad · Yossi Gandelsman · Yonatan Belinkov
Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the circuits---the task-specific computational sub-graphs---in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers. In experiments with multiple tasks and models, this simple intervention closes a third of the performance gap between the modalities, on average. Our analysis sheds light on the multi-modal performance gap in VLMs and suggests a training-free approach for reducing it.
Detecting High-Stakes Interactions with Activation Probes
Alex McKenzie · Urja Pawar · Phil Blandfort · William Bankes · David Krueger · Ekdeep S Lubana · Dmitrii Krasheninnikov
Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting ``high-stakes'' interactions---where the text indicates that the interaction might lead to significant harm---as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. These savings are enabled by reusing activations of the model that is being monitored. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and the codebase at \url{https://github.com/arrrlex/models-under-pressure}.
Robust Explanations of Graph Neural Networks via Graph Curvatures
Yazheng Liu · Xi Zhang · Sihong Xie · Hui Xiong
Explaining graph neural networks (GNNs) is a key approach to improve the trustworthiness of GNN in high-stakes applications, such as finance and healthcare. However, existing methods are vulnerable to perturbations, raising concerns about explanation reliability. Prior methods enhance explanation robustness using model retraining or explanation ensemble, with certain weaknesses. Retraining leads to models that are different from the original target model and misleading explanations, while ensemble can produce contradictory results due to different inputs or models. To improve explanation robustness without the above weaknesses, we take an unexplored route and exploit the two edge geometry properties curvature and resistance to enhance explanation robustness. We are the first to prove that these geometric notions can be used to bound explanation robustness. We design a general optimization algorithm to incorporate these geometric properties into a wide spectrum of base GNN explanation methods to enhance the robustness of base explanations. We empirically show that our method outperforms six base explanation methods in robustness across nine datasets spanning node classification, link prediction, and graph classification tasks, improving fidelity in 80\% of the cases and achieving up to a 10\% relative improvement in robust performance. The code is available at https://github.com/yazhengliu/Robustexplanationcurvature.
IPAD: Inverse Prompt for AI Detection - A Robust and Interpretable LLM-Generated Text Detector
Zheng CHEN · Yushi Feng · Jisheng Dang · Changyang He · Yue Deng · Hongxi Pu · Haoxuan Li · Bo Li
Large Language Models (LLMs) have attained human-level fluency in text generation, which complicates the distinguishing between human-written and LLM generated texts. This increases the risk of misuse and highlights the need for reliable detectors. Yet, existing detectors exhibit poor robustness on out-of-distribution (OOD) data and attacked data, which is critical for real-world scenarios. Also, they struggle to provide interpretable evidence to support their decisions, thus undermining reliability. In light of these challenges, we propose IPAD (Inverse Prompt for AI Detection), a novel framework consisting of a Prompt Inverter that identifies predicted prompts that could have generated the input text, and two Distinguishers that examine the probability that the input texts align with the predicted prompts. Empirical evaluations demonstrate that IPAD outperforms the strongest baselines by 9.05% (Average Recall) on in-distribution data, 12.93% (AUROC) on out-of-distribution (OOD) data, and 5.48% (AUROC) on attacked data. IPAD also performs robust on structured datasets. Furthermore, an interpretability assessment is conducted to illustrate that IPAD enhances the AI detection trustworthiness by allowing users to directly examine the decision-making evidence, which provides interpretable support for its state-of-the-art detection results.
Additive Models Explained: A Computational Complexity Approach
Shahaf Bassan · Michal Moshkovitz · Guy Katz
Generalized Additive Models (GAMs) are commonly considered *interpretable* within the ML community, as their structure makes the relationship between inputs and outputs relatively understandable. Therefore, it may seem natural to hypothesize that obtaining meaningful explanations for GAMs could be performed efficiently and would not be computationally infeasible. In this work, we challenge this hypothesis by analyzing the *computational complexity* of generating different explanations for various forms of GAMs across multiple contexts. Our analysis reveals a surprisingly diverse landscape of both positive and negative complexity outcomes. Particularly, under standard complexity assumptions such as P$\neq$NP, we establish several key findings: (1) in stark contrast to many other common ML models, the complexity of generating explanations for GAMs is heavily influenced by the structure of the input space; (2) the complexity of explaining GAMs varies significantly with the types of component models used - but interestingly, these differences only emerge under specific input domain settings; (3) significant complexity distinctions appear for obtaining explanations in regression tasks versus classification tasks in GAMs; and (4) expressing complex models like neural networks additively (e.g., as neural additive models) can make them easier to explain, though interestingly, this benefit appears only for certain explanation methods and input domains. Collectively, these results shed light on the feasibility of computing diverse explanations for GAMs, offering a rigorous theoretical picture of the conditions under which such computations are possible or provably hard.
Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment
Samuel (Min-Hsuan) Yeh · Sharon Li
Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various automated data cleaning methods have been proposed to mitigate this issue, a systematic evaluation of their effectiveness and generalizability remains lacking. To bridge this gap, we introduce the first comprehensive benchmark for evaluating 13 preference data cleaning methods in the context of LLM alignment. Our framework offers a standardized protocol to assess cleaning strategies in terms of alignment performance and generalizability across diverse datasets, model architectures, and optimization algorithms. By unifying disparate methods and rigorously comparing them, we uncover key factors that determine the success of data cleaning in alignment tasks. This benchmark lays the groundwork for principled and reproducible approaches to improving LLM alignment through better data quality—highlighting the crucial but underexplored role of data preprocessing in responsible AI development. We release modular implementations of all methods to catalyze further research: https://github.com/deeplearning-wisc/PrefCleanBench.
QUT-DV25: A Dataset for Dynamic Analysis of Next-Gen Software Supply Chain Attacks
Sk Tanzir Mehedi · Raja Jurdak · Chadni Islam · Gowri Sankar Ramachandran
Securing software supply chains is a growing challenge due to the inadequacy of existing datasets in capturing the complexity of next-gen attacks, such as multiphase malware execution, remote access activation, and dynamic payload generation. Existing datasets, which rely on metadata inspection and static code analysis, are inadequate for detecting such attacks. This creates a critical gap because these datasets do not capture what happens during and after a package is installed. To address this gap, we present QUT-DV25, a dynamic analysis dataset specifically designed to support and advance research on detecting and mitigating supply chain attacks within the Python Package Index (PyPI) ecosystem. This dataset captures install and post-install-time traces from 14,271 Python packages, of which 7,127 are malicious. The packages are executed in an isolated sandbox environment using an extended Berkeley Packet Filter (eBPF) kernel and user-level probes. It captures 36 real-time features, that includes system calls, network traffic, resource usages, directory access patterns, dependency logs, and installation behaviors, enabling the study of next-gen attack vectors. ML analysis using the QUT-DV25 dataset identified four malicious PyPI packages previously labeled as benign, each with thousands of downloads. These packages deployed covert remote access and multi-phase payloads, were reported to PyPI maintainers, and subsequently removed. This highlights the practical value of QUT-DV25, as it outperforms reactive, metadata, and static datasets, offering a robust foundation for developing and benchmarking advanced threat detection within the evolving software supply chain ecosystem.
Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers, and Gradient Clipping
Martin Pelikan · Shams Azam · Vitaly Feldman · Jan Silovsky · Kunal Talwar · Christopher Brinton · Tatiana Likhomanenko
While federated learning (FL) and differential privacy (DP) have been extensively studied, their application to automatic speech recognition (ASR) remains largely unexplored due to the challenges in training large transformer models. Specifically, large models further exacerbate issues in FL as they are particularly susceptible to gradient heterogeneity across layers, unlike the relatively uniform gradient behavior observed in shallow models. As a result, prior works struggle to converge with standard optimization techniques, even in the absence of DP mechanisms. To the best of our knowledge, no existing work establishes a competitive, practical recipe for FL with DP in the context of ASR. To address this gap, we establish **the first benchmark for FL with DP** in end-to-end ASR. Our approach centers on per-layer clipping and layer-wise gradient normalization: theoretical analysis reveals that these techniques together mitigate clipping bias and gradient heterogeneity across layers in deeper models. Consistent with these theoretical insights, our empirical results show that FL with DP is viable under strong privacy guarantees, provided a population of at least several million users. Specifically, we achieve user-level ($7.2$, $10^{-9}$)-DP (resp. ($4.5$, $10^{-9}$)-DP) with only a 1.3\% (resp. 4.6\%) absolute drop in word error rate when extrapolating to high (resp. low) population scales for FL with DP in ASR. Although our experiments focus on ASR, the underlying principles we uncover — particularly those concerning gradient heterogeneity and layer-wise gradient normalization — offer broader guidance for designing scalable, privacy-preserving FL algorithms for large models across domains. Code of all experiments and benchmarks is available at https://github.com/apple/ml-pfl4asr.
Puzzles: Unbounded Video-Depth Augmentation for Scalable End-to-End 3D Reconstruction
Jiahao Ma · Lei Wang · Miaomiao Liu · David Ahmedt-Aristizabal · Chuong Nguyen
Multi-view 3D reconstruction remains a core challenge in computer vision. Recent methods, such as DUSt3R and its successors, directly regress pointmaps from image pairs without relying on known scene geometry or camera parameters. However, the performance of these models is constrained by the diversity and scale of available training data. In this work, we introduce Puzzles, a data augmentation strategy that synthesizes an unbounded volume of high-quality, posed video-depth data from just a single image or video clip. By simulating diverse camera trajectories and realistic scene geometry through targeted image transformations, Puzzles significantly enhances data variety. Extensive experiments show that integrating Puzzles into existing video‑based 3D reconstruction pipelines consistently boosts performance, all without modifying the underlying network architecture. Notably, models trained on only 10% of the original data, augmented with Puzzles, achieve accuracy comparable to those trained on the full dataset.
MASTER: Enhancing Large Language Model via Multi-Agent Simulated Teaching
Liang Yue · Yihong Tang · Kehai Chen · Jie Liu · Min Zhang
Instruction fine-tuning is crucial in NLP tasks, enhancing pretrained models' instruction-following capabilities and task-specific performance. However, obtaining high-quality fine-tuning data for large models is challenging due to data collection difficulties and high production costs. To address this, we propose MASTER, a novel data augmentation method that enriches original data through interactions among multiple agents with varying cognitive levels. We simulate three pedagogically grounded teaching scenarios, leveraging multi-agent conversations to generate high-quality teacher-student interaction data. Utilizing MASTER, we construct BOOST-QA, a fine-tuning dataset augmented from existing datasets like Orca-Math-200k, ProcQA, and OpenHermes2.5. Experiments show that models fine-tuned with BOOST-QA perform excellently across multiple benchmarks, demonstrating strong multitask generalization. Notably, MASTER significantly improves models' reasoning abilities in complex tasks, providing valuable insights for future research.
FACE: Faithful Automatic Concept Extraction
Dipkamal Bhusal · Michael Clifford · Sara Rampazzi · Nidhi Rastogi
Interpreting deep neural networks through concept-based explanations offers a bridge between low-level features and high-level human-understandable semantics. However, existing automatic concept discovery methods often fail to align these extracted concepts with the model’s true decision-making process, thereby compromising explanation faithfulness. In this work, we propose FACE (Faithful Automatic Concept Extraction), a novel framework that combines Non-negative Matrix Factorization (NMF) with a Kullback-Leibler (KL) divergence regularization term to ensure alignment between the model’s original and concept-based predictions. Unlike prior methods that operate solely on encoder activations, FACE incorporates classifier supervision during concept learning, enforcing predictive consistency and enabling faithful explanations. We provide theoretical guarantees showing that minimizing the KL divergence bounds the deviation in predictive distributions, thereby promoting faithful local linearity in the learned concept space. Systematic evaluations on ImageNet, COCO, and CelebA datasets demonstrate that FACE outperforms existing methods across faithfulness and sparsity metrics.
Uncovering a Universal Abstract Algorithm for Modular Addition in Neural Networks
Gavin McCracken · Gabriela Moisescu-Pareja · Vincent Létourneau · Doina Precup · Jonathan Love
We propose a testable universality hypothesis, asserting that seemingly disparate neural network solutions observed in the simple task of modular addition actually reflect a common abstract algorithm. While prior work interpreted variations in neuron-level representations as evidence for distinct algorithms, we demonstrate---through multi-level analyses spanning neurons, neuron clusters, and entire networks---that multilayer perceptrons and transformers universally implement the abstract algorithm we call the approximate Chinese Remainder Theorem. Crucially, we introduce approximate cosets and show that neurons activate exclusively on them. Furthermore, our theory works for deep neural networks (DNNs). It predicts that universally learned solutions in DNNs with trainable embeddings or more than one hidden layer require only $\mathcal{O}(\log n)$ features, a result we empirically confirm. This work thus provides the first theory‑backed interpretation of \textit{multilayer} networks solving modular addition. It advances generalizable interpretability and opens a testable universality hypothesis for group multiplication beyond modular addition.
Collective Bargaining in the Information Economy Can Address AI-Driven Power Concentration
Nicholas Vincent · Matt Prewitt · Hanlin Li
This position paper argues that there is an urgent need to restructure markets for the information that goes into AI systems. Specifically, small-to-medium sized producers of information (such as journalists, news organizations, researchers, and creative professionals) need to be able to appoint representatives who can carry out "collective bargaining" with AI product builders in order to receive a reasonable terms and a fair return on the informational value they contribute. Obstacles to this market structure can be removed through technical work that facilitates collective bargaining in the information economy (e.g., explainable data value estimation and federated data management tools) and regulatory/policy interventions (e.g., support for trusted data intermediary organizations that represent guilds or syndicates of information producers). We argue that without collective bargaining in the information economy, AI will exacerbate a large-scale "information market failure" that will lead not only to undesirable concentration of capital, but also to a potential "ecological collapse" in the informational commons. On the other hand, collective bargaining in the information economy can create market conditions necessary for a pro-social AI future. We provide concrete actions that can be taken to support a coalition-based approach to achieve this.
SafeVid: Toward Safety Aligned Video Large Multimodal Models
Yixu Wang · Jiaxin Song · Yifeng Gao · Xin Wang · YANG YAO · Yan Teng · Xingjun Ma · Yingchun Wang · Yu-Gang Jiang
As Video Large Multimodal Models (VLMMs) rapidly advance, their inherent complexity introduces significant safety challenges, particularly the issue of mismatched generalization where static safety alignments fail to transfer to dynamic video contexts. We introduce SafeVid, a framework designed to instill video-specific safety principles in VLMMs. SafeVid uniquely transfers robust textual safety alignment capabilities to the video domain by employing detailed textual video descriptions as an interpretive bridge, facilitating LLM-based rule-driven safety reasoning. This is achieved through a closed-loop system comprising: 1) generation of SafeVid-350K, a novel 350,000-pair video-specific safety preference dataset; 2) targeted alignment of VLMMs using Direct Preference Optimization (DPO); and 3) comprehensive evaluation via our new SafeVidBench benchmark. Alignment with SafeVid-350K significantly enhances VLMM safety, with models like LLaVA-NeXT-Video demonstrating substantial improvements (e.g., up to 42.39%) on SafeVidBench. SafeVid provides critical resources and a structured approach, demonstrating that leveraging textual descriptions as a conduit for safety reasoning markedly improves the safety alignment of VLMMs in complex multimodal scenarios.
Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs
Xuannan Liu · Zekun Li · Zheqi He · Peipei Li · shuhan xia · Xing Cui · Huaibo Huang · Xi Yang · Ran He
The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.
Path-specific effects for pulse-oximetry guided decisions in critical care
Kevin Zhang · Yonghan Jung · Divyat Mahajan · Karthikeyan Shanmugam · Shalmali Joshi
Identifying and measuring biases associated with sensitive attributes is a crucial consideration in healthcare to prevent treatment disparities. One prominent issue is inaccurate pulse oximeter readings, which tend to overestimate oxygen saturation for dark-skinned patients and misrepresent supplemental oxygen needs. Most existing research has revealed statistical disparities linking device measurement errors to patient outcomes in intensive care units (ICUs) without causal formalization. This study causally investigates how racial discrepancies in oximetry measurements affect invasive ventilation in ICU settings. We employ a causal inference-based approach using path-specific effects to isolate the impact of bias by race on clinical decision-making. To estimate these effects, we leverage a doubly robust estimator, propose its self-normalized variant for improved sample efficiency, and provide novel finite-sample guarantees. Our methodology is validated on semi-synthetic data and applied to two large real-world health datasets: MIMIC-IV and eICU. Contrary to prior work, our analysis reveals minimal impact of racial discrepancies on invasive ventilation rates. However, path-specific effects mediated by oxygen saturation disparity are more pronounced on ventilation duration, and the severity differs by dataset. Our work provides a novel pipeline for investigating potential disparities in clinical decision-making and, more importantly, highlights the necessity of causal methods to robustly assess fairness in healthcare.
On Fairness of Unified Multimodal Large Language Model for Image Generation
Ming Liu · Hao Chen · Jindong Wang · Liwen Wang · Bhiksha Raj · Wensheng Zhang
Unified multimodal large language models (U-MLLMs) have demonstrated impressive performance in end-to-end visual understanding and generation tasks. However, compared to generation-only systems (e.g., Stable Diffusion), the unified architecture of U-MLLMs introduces new risks of propagating demographic stereotypes. In this paper, we benchmark several state-of-the-art U-MLLMs and show that they exhibit significant gender and race biases in the generated outputs. To diagnose the source of these biases, we propose a locate-then-fix framework: we first audit the vision and language components — using techniques such as linear probing and controlled generation — and find that the language model appears to be a primary origin of the observed generative bias. Moreover, we observe a ``partial alignment'' phenomenon, where the U-MLLMs exhibit less bias in understanding tasks yet produce substantially biased images. To address this, we introduce a novel \emph{balanced preference loss} that enforces uniform generation probabilities across demographics by leveraging a synthetically balanced dataset. Extensive experiments show that our approach significantly reduces demographic bias while preserving semantic fidelity and image quality. Our findings underscore the need for targeted debiasing strategies in unified multimodal systems and introduce a practical approach to mitigate biases.
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Hao Fang · Changle Zhou · Jiawei Kong · Kuofeng Gao · Bin Chen · Tao Liang · Guojun Ma · Shu-Tao Xia
Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
CTRL-ALT-DECEIT Sabotage Evaluations for Automated AI R&D
Francis Ward · Teun van der Weij · Hanna Gábor · Sam Martin · Raja Moreno · Harel Lidar · Louis Makower · Thomas Jodrell · Lauren Robson
AI systems are increasingly able to autonomously conduct realistic software engineering tasks, and may soon be deployed to automate machine learning (ML) R\&D itself. Frontier AI systems may be deployed in safety-critical settings, including to help ensure the safety of future systems. Unfortunately, frontier and future systems may not be sufficiently trustworthy, and there is evidence that these systems may even be misaligned with their developers or users. Therefore, we investigate the capabilities of AI agents to act against the interests of their users when conducting ML engineering, by sabotaging ML models, sandbagging their performance, and subverting oversight mechanisms. First, we extend MLE-Bench, a benchmark for realistic ML tasks, with code-sabotage tasks such as implanting backdoors and purposefully causing generalisation failures. Frontier agents make meaningful progress on our sabotage tasks. In addition, we study agent capabilities to sandbag on MLE-Bench. Agents can calibrate their performance to specified target levels below their actual capability. To mitigate sabotage, we use LM monitors to detect suspicious agent behaviour, and we measure model capability to sabotage and sandbag without being detected by these monitors. Overall, monitors are capable at detecting code-sabotage attempts but our results suggest that detecting sandbagging is more difficult. Additionally, aggregating multiple monitor predictions works well, but monitoring may not be sufficiently reliable to mitigate sabotage in high-stakes domains. Our benchmark is implemented in the UK AISI’s Inspect framework and we make our code publicly available.
EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions
Xiaorui Wu · Fei Li · Xiaofeng Mao · Xin Zhang · Li Zheng · Yuxiang Peng · Chong Teng · Donghong Ji · Zhuang Li
Large language models (LLMs) frequently refuse to respond to pseudo-malicious instructions: semantically harmless input queries triggering unnecessary LLM refusals due to conservative safety alignment, significantly impairing user experience. Collecting such instructions is crucial for evaluating and mitigating over-refusals, but existing instruction curation methods, like manual creation or instruction rewriting, either lack scalability or fail to produce sufficiently diverse and effective refusal-inducing prompts. To address these limitations, we introduce EVOREFUSE, a prompt optimization approach that generates diverse pseudo-malicious instructions consistently eliciting confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm exploring the instruction space in more diverse directions than existing methods via mutation strategies and recombination, and iteratively evolves seed instructions to maximize evidence lower bound on LLM refusal probability. Using EVOREFUSE, we create two novel datasets: EVOREFUSE-TEST, a benchmark of 582 pseudo-malicious instructions that outperforms the next-best benchmark with 85.34% higher average refusal triggering rate across 9 LLMs, 34.86% greater lexical diversity, and 40.03% improved LLM response confidence scores; and EVOREFUSE-ALIGN, which provides 3,000 pseudo-malicious instructions with responses for supervised and preference-based alignment training. LLAMA3.1-8B-INSTRUCT supervisedly fine-tuned on EVOREFUSE-ALIGN achieves up to 29.85% fewer over-refusals than models trained on the second-best alignment dataset, without compromising safety. Our analysis with EVOREFUSE-TEST reveals models trigger over-refusals by overly focusing on sensitive keywords while ignoring broader context. Our code and datasets are available at https://github.com/FishT0ucher/EVOREFUSE.
Fairness-aware Anomaly Detection via Fair Projection
Feng Xiao · Xiaoying Tang · Jicong Fan
Unsupervised anomaly detection is a critical task in many high-social-impact applications such as finance, healthcare, social media, and cybersecurity, where demographics involving age, gender, race, disease, etc. are used frequently. In these scenarios, possible bias from anomaly detection systems can lead to unfair treatment for different groups and even exacerbate social bias. In this work, first, we thoroughly analyze the feasibility and necessary assumptions for ensuring group fairness in unsupervised anomaly detection. Second, we propose a novel fairness-aware anomaly detection method FairAD. From the normal training data, FairAD learns a projection to map data of different demographic groups to a common target distribution that is simple and compact, and hence provides a reliable base to estimate the density of the data. The density can be directly used to identify anomalies while the common target distribution ensures fairness between different groups. Furthermore, we propose a threshold-free fairness metric that provides a global view for model's fairness, eliminating dependence on manual threshold selection. Experiments on real-world benchmarks demonstrate that our method achieves an improved trade-off between detection accuracy and fairness under both balanced and skewed data across different groups.
Truth over Tricks: Measuring and Mitigating Shortcut Learning in Misinformation Detection
Herun Wan · Jiaying Wu · Minnan Luo · Zhi Zeng · Zhixiong Su
Misinformation detectors often rely on superficial cues (i.e., shortcuts) that correlate with misinformation in training data but fail to generalize to the diverse and evolving nature of real-world misinformation. This issue is exacerbated by large language models (LLMs), which can easily generate convincing misinformation using simple prompts. We introduce TruthOverTricks, a unified evaluation paradigm for measuring shortcut learning in misinformation detection. TruthOverTricks categorizes shortcut behaviors into intrinsic shortcut induction and extrinsic shortcut injection, and evaluates seven representative detectors across 14 popular benchmarks, along with two new factual misinformation datasets, NQ-Misinfo and Streaming-Misinfo. Empirical results reveal that existing detectors suffer severe performance degradation when exposed to both naturally occurring and adversarially crafted shortcuts. To address this, we propose the Shortcut Mitigation Framework (SMF), an LLM-augmented data augmentation framework that mitigates shortcut reliance through paraphrasing, factual summarization, and sentiment normalization. SMF consistently enhances robustness across 16 benchmarks, forcing models to rely on deeper semantic understanding rather than shortcut cues.
Unifying Proportional Fairness in Centroid and Non-Centroid Clustering
Benjamin Cookson · Nisarg Shah · Ziqi Yu
Proportional fairness criteria inspired by democratic ideals of proportional representation have received growing attention in the clustering literature. Prior work has investigated them in two separate paradigms. Chen et al. [ICML 2019] study centroid clustering, in which each data point's loss is determined by its distance to a representative point (centroid) chosen in its cluster. Caragiannis et al. [NeurIPS 2024] study non-centroid clustering, in which each data point's loss is determined by its maximum distance to any other data point in its cluster. We generalize both paradigms to introduce semi-centroid clustering, in which each data point's loss is a combination of its centroid and non-centroid losses, and study two proportional fairness criteria---the core and, its relaxation, fully justified representation (FJR). Our main result is a novel algorithm which achieves a constant approximation to the core, in polynomial time, even when the distance metrics used for centroid and non-centroid loss measurements are different. We also derive improved results for more restricted loss functions and the weaker FJR criterion, and establish lower bounds in each case.
Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers
Xin Zhao · Xiaojun Chen · Bingshan Liu · Haoyu Gao · Zhendong Zhao · Yilong Chen
Large language models (LLMs) with Mixture-of-Experts (MoE) architectures achieve impressive performance and efficiency by dynamically routing inputs to specialized subnetworks, known as experts. However, this sparse routing mechanism inherently exhibits task preferences due to expert specialization, introducing a new and underexplored vulnerability to backdoor attacks. In this work, we investigate the feasibility and effectiveness of injecting backdoors into MoE-based LLMs by exploiting their inherent expert routing preferences. We thus propose \textbf{BadSwitch}, a novel backdoor framework that integrates task-coupled dynamic trigger optimization with a sensitivity-guided Top-S expert tracing mechanism. Our approach jointly optimizes trigger embeddings during pretraining while identifying S most sensitive experts, subsequently constraining the Top-K gating mechanism to these targeted experts. Unlike traditional backdoor attacks that rely on superficial data poisoning or model editing, BadSwitch primarily embeds malicious triggers into expert routing paths with strong task affinity, enabling precise and stealthy model manipulation. Through comprehensive evaluations across three prominent MoE architectures (Switch Transformer, QwenMoE, and DeepSeekMoE), we demonstrate that BadSwitch can efficiently hijack pre-trained models with up to 100\% success rate (ASR) while maintaining the highest clean accuracy (ACC) among all baselines. Furthermore, BadSwitch exhibits strong resilience against both text-level and model-level defense mechanisms, achieving 94.07\% ASR and 87.18\% ACC on the AGNews dataset. Our analysis of expert activation patterns reveals fundamental insights into MoE vulnerabilities. We anticipate this work will expose security risks in MoE systems and contribute to advancing AI safety.
DataSIR: A Benchmark Dataset for Sensitive Information Recognition
Fan Mo · Bo Liu · Yuan Fan · Kun Qin · Yizhou Zhao · Jinhe Zhou · Jia Sun · Jinfei Liu · Kui Ren
With the rapid development of artificial intelligence technologies, the demand for training data has surged, exacerbating risks of data leakage. Despite increasing incidents and costs associated with such leaks, data leakage prevention (DLP) technologies lag behind evolving evasion techniques that bypass existing sensitive information recognition (SIR) models. Current datasets lack comprehensive coverage of these adversarial transformations, limiting the evaluation of robust SIR systems. To address this gap, we introduce DataSIR, a benchmark dataset specifically designed to evaluate SIR models on sensitive data subjected to diverse format transformations. We curate 26 sensitive data categories based on multiple international regulations, and collect 131,890 original samples correspondingly. Through empirical analysis of real-world evasion tactics, we implement 21 format transformation methods, which are applied to the original samples, expanding the dataset to 1,647,501 samples to simulate adversarial scenarios. We evaluated DataSIR using four traditional NLP models and four large language models (LLMs). For LLMs, we design structured prompts with varying degrees of contextual hints to assess the impact of prior knowledge on recognition accuracy. These evaluations demonstrate that our dataset effectively differentiates the performance of various SIR algorithms. Combined with its rich category and format diversity, the dataset can serve as a benchmark for evaluating related models and help develop future more advanced SIR models. Our dataset and experimental code are publicly available at https://www.kaggle.com/datasets/fanmo1/datasir and https://github.com/Fan-Mo-ZJU/DataSIR.
FracFace: Breaking The Visual Clues—Fractal-Based Privacy-Preserving Face Recognition
Wanying Dai · Beibei Li · Naipeng Dong · Guangdong Bai · Jin Song Dong
Face recognition is essential for identity authentication, but the rich visual clues in facial images pose significant privacy risks, highlighting the critical importance of privacy-preserving solutions. For instance, numerous studies have shown that generative models are capable of effectively performing reconstruction attacks that result in the restoration of original visual clues. To mitigate this threat, we introduce FracFace, a fractal-based privacy-preserving face recognition framework. This approach effectively weakens the visual clues that can be exploited by reconstruction attacks by disrupting the spatial structure in frequency domain features, while retaining the vital visual clues required for identity recognition. To achieve this, we craft a Frequency Channels Refining module that reduces sparsity in the frequency domain. It suppresses visual clues that could be exploited by reconstruction attacks, while preserving features indispensable for recognition, thus making these attacks more challenging. More significantly, we design a Frequency Fractal Mapping module that obfuscates deep representations by remapping refined frequency channels into a fractal-based privacy structure. By leveraging the self-similarity of fractals, this module preserves identity relevant features while enhancing defense capabilities, thereby improving the overall robustness of the protection scheme. Experiments conducted on multiple public face recognition benchmarks demonstrate that the proposed FracFace significantly reduces the visual recoverability of facial features, while maintaining high recognition accuracy, as well as the superiorities over state-of-the-art privacy protection approaches.
Non-exchangeable Conformal Prediction with Optimal Transport: Tackling Distribution Shift with Unlabeled Data
Alvaro Correia · Christos Louizos
Conformal prediction is a distribution-free uncertainty quantification method that has gained popularity in the machine learning community due to its finite-sample guarantees and ease of use. Its most common variant, dubbed split conformal prediction, is also computationally efficient as it boils down to collecting statistics of the model predictions on some calibration data not yet seen by the model. Nonetheless, these guarantees only hold if the calibration and test data are exchangeable, a condition that is difficult to verify and often violated in practice due to so-called distribution shifts. The literature is rife with methods to mitigate the loss in coverage in this non-exchangeable setting, but these methods require some prior information on the type of distribution shift to be expected at test time. In this work, we study this problem via a new perspective, through the lens of optimal transport, and show that it is possible to estimate the loss in coverage and mitigate arbitrary distribution shifts, offering a principled and broadly applicable solution.
KScope: A Framework for Characterizing the Knowledge Status of Language Models
Yuxin Xiao · Shan Chen · Jack Gallifant · Danielle Bitterman · Tom Hartvigsen · Marzyeh Ghassemi
Characterizing a large language model's (LLM's) knowledge of a given question is challenging. As a result, prior work has primarily examined LLM behavior under knowledge conflicts, where the model's internal parametric memory contradicts information in the external context. However, this does not fully reflect how well the model knows the answer to the question. In this paper, we first introduce a taxonomy of five knowledge statuses based on the consistency and correctness of LLM knowledge modes. We then propose KScope, a hierarchical framework of statistical tests that progressively refines hypotheses about knowledge modes and characterizes LLM knowledge into one of these five statuses. We apply KScope to nine LLMs across four datasets and systematically establish: (1) Supporting context narrows knowledge gaps across models. (2) Context features related to difficulty, relevance, and familiarity drive successful knowledge updates. (3) LLMs exhibit similar feature preferences when partially correct or conflicted, but diverge sharply when consistently wrong. (4) Context summarization constrained by our feature analysis, together with enhanced credibility, further improves update effectiveness and generalizes across LLMs.
Theoretically Grounded Framework for LLM Watermarking: A Distribution-Adaptive Approach
Haiyun He · Yepeng Liu · Ziqiao Wang · Yongyi Mao · Yuheng Bu
Watermarking has emerged as a crucial method to distinguish AI-generated text from human-created text. Current watermarking approaches often lack formal optimality guarantees or address the scheme and detector design separately. In this paper, we introduce a novel, unified theoretical framework for watermarking Large Language Models (LLMs) that jointly optimizes both the watermarking scheme and detector. Our approach aims to maximize detection performance while maintaining control over the worst-case false positive rate (FPR) and distortion on text quality. We derive closed-form optimal solutions for this joint design and characterize the fundamental trade-off between watermark detectability and distortion. Notably, we reveal that the optimal watermarking schemes should be adaptive to the LLM’s generative distribution. Building on our theoretical insights, we propose a distortion-free, distribution-adaptive watermarking algorithm (DAWA) that leverages a surrogate model for model-agnosticism and efficiency. Experiments on Llama2-13B and Mistral-8$\times$7B models confirm the effectiveness of our approach, particularly at ultra-low FPRs. Our code is available at \url{https://github.com/yepengliu/DAWA}.
Advancing Machine-Generated Text Detection from an Easy to Hard Supervision Perspective
Chenwang Wu · Yiu-ming Cheung · Bo Han · Defu Lian
Existing machine-generated text (MGT) detection methods implicitly assume labels as the "golden standard". However, we reveal boundary ambiguity in MGT detection, implying that traditional training paradigms are inexact. Moreover, limitations of human cognition and the superintelligence of detectors make inexact learning widespread and inevitable. To this end, we propose an easy-to-hard enhancement framework to provide reliable supervision under such inexact conditions. Distinct from knowledge distillation, our framework employs an easy supervisor targeting relatively simple longer-text detection tasks (despite weaker capabilities), to enhance the more challenging target detector. Firstly, longer texts targeted by supervisors theoretically alleviate the impact of inexact labels, laying the foundation for reliable supervision. Secondly, by structurally incorporating the detector into the supervisor, we theoretically model the supervisor as a lower performance bound for the detector. Thus, optimizing the supervisor indirectly optimizes the detector, ultimately approximating the underlying "golden" labels. Extensive experiments across diverse practical scenarios, including cross-LLM, cross-domain, mixed text, and paraphrase attacks, demonstrate the framework's significant detection effectiveness. The code is available at: \url{https://github.com/tmlr-group/Easy2Hard}.
DualOptim: Enhancing Efficacy and Stability in Machine Unlearning with Dual Optimizers
Xuyang Zhong · Haochen Luo · Chen Liu
Existing machine unlearning (MU) approaches exhibit significant sensitivity to hyperparameters, requiring meticulous tuning that limits practical deployment. In this work, we first empirically demonstrate the instability and suboptimal performance of existing popular MU methods when deployed in different scenarios. To address this issue, we propose Dual Optimizer (DualOptim), which incorporates adaptive learning rate and decoupled momentum factors. Empirical and theoretical evidence demonstrates that DualOptim contributes to effective and stable unlearning. Through extensive experiments, we show that DualOptim can significantly boost MU efficacy and stability across diverse tasks, including image classification, image generation, and large language models, making it a versatile approach to empower existing MU algorithms.
SteerConf: Steering LLMs for Confidence Elicitation
Ziang Zhou · Tianyuan Jin · Jieming Shi · Qing Li
Large Language Models (LLMs) exhibit impressive performance across diverse domains but often suffer from overconfidence, limiting their reliability in critical applications. We propose SteerConf, a novel framework that systematically steers LLMs' confidence scores to improve their calibration and reliability. SteerConf introduces three key components: (1) a steering prompt strategy that guides LLMs to produce confidence scores in specified directions (e.g., conservative or optimistic) by leveraging prompts with varying steering levels; (2) a steered confidence consistency measure that quantifies alignment across multiple steered confidences to enhance calibration; and (3) a steered confidence calibration method that aggregates confidence scores using consistency measures and applies linear quantization for answer selection. SteerConf operates without additional training or fine-tuning, making it broadly applicable to existing LLMs. Experiments on seven benchmarks spanning professional knowledge, common sense, ethics, and reasoning tasks, using advanced LLM models (GPT-3.5, LLaMA 3, GPT-4), demonstrate that SteerConf significantly outperforms existing methods, often by a significant margin. Our findings highlight the potential of steering the confidence of LLMs to enhance their reliability for safer deployment in real-world applications. The implementation is at \url{https://github.com/scottjiao/SteerConf}.
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
Advik Basani · Xiao Zhang
LLMs have demonstrated impressive capabilities across various natural language processing tasks yet remain vulnerable to prompts, known as jailbreak attacks, carefully designed to bypass safety guardrails and elicit harmful responses. Traditional methods rely on manual heuristics that suffer from limited generalizability. Despite being automatic, optimization-based attacks often produce unnatural jailbreak prompts that can be easily detected by safety filters or require high computational costs due to discrete token optimization. This paper introduces Generative Adversarial Suffix Prompter (GASP), a novel automated framework that can efficiently generate human-readable jailbreak prompts in a fully black-box setting. In particular, GASP leverages latent Bayesian optimization to craft adversarial suffixes by efficiently exploring continuous latent spaces, gradually optimizing the suffix generator to improve attack efficacy while balancing prompt coherence via a targeted iterative refinement procedure. Through comprehensive experiments, we show that GASP can produce natural adversarial prompts, significantly improving jailbreak success, reducing training times, and accelerating inference speed, thus making it an efficient and scalable solution for red-teaming LLMs.
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models
Hao Cheng · Erjia Xiao · Jing Shao · Yichi Wang · Le Yang · Chao Shen · Philip Torr · Jindong Gu · Renjing Xu
Large Language Models (LLMs) demonstrate impressive zero-shot performance across a wide range of natural language processing tasks. Integrating various modality encoders further expands their capabilities, giving rise to Multimodal Large Language Models (MLLMs) that process not only text but also visual and auditory modality inputs. However, these advanced capabilities may also pose significant safety problems, as models can be exploited to generate harmful or inappropriate content through jailbreak attack. While prior work has extensively explored how manipulating textual or visual modality inputs can circumvent safeguards in LLMs and MLLMs, the vulnerability of audio-specific Jailbreak on Large Audio-Language Models (LALMs) remains largely underexplored. To address this gap, we introduce \textbf{Jailbreak-AudioBench}, which consists of the Toolbox, curated Dataset, and comprehensive Benchmark. The Toolbox supports not only text-to-audio conversion but also various editing techniques for injecting audio hidden semantics. The curated Dataset provides diverse explicit and implicit jailbreak audio examples in both original and edited forms. Utilizing this dataset, we evaluate multiple state-of-the-art LALMs and establish the most comprehensive Jailbreak benchmark to date for audio modality. Finally, Jailbreak-AudioBench establishes a foundation for advancing future research on LALMs safety alignment by enabling the in-depth exposure of more powerful jailbreak threats, such as query-based audio editing, and by facilitating the development of effective defense mechanisms.
Position: If Innovation in AI systematically Violates Fundamental Rights, Is It Innovation at All?
Josu Eguíluz · Axel Brando · Migle Laukyte · Marc Serra-Vidal
Artificial intelligence (AI) now permeates critical infrastructures and decisionmaking systems where failures produce social, economic, and democratic harm. This position paper challenges the entrenched belief that regulation and innovation are opposites. As evidenced by analogies from aviation, pharmaceuticals, and welfare systems and recent cases of synthetic misinformation, bias and unaccountable decision-making, the absence of well-designed regulation has already created immeasurable damage. Regulation, when thoughtful and adaptive, is not a brake on innovation—it is its foundation. The present position paper examines the EU AI Act as a model of risk-based, responsibility-driven regulation that addresses the Collingridge Dilemma: acting early enough to prevent harm, yet flexibly enough to sustain innovation. Its adaptive mechanisms—regulatory sandboxes, small and medium enterprises (SMEs) support, real-world testing, fundamental rights impact assessment (FRIA)—demonstrate how regulation can accelerate responsibly, rather than delay, technological progress. The position paper summarises how governance tools transform perceived burdens into tangible advantages: legal certainty, consumer trust, and ethical competitiveness. Ultimately, the paper reframes progress: innovation and regulation advance together. By embedding transparency, impact assessments, accountability, and AI literacy into design and deployment, the EU framework defines what responsible innovation truly means—technological ambition disciplined by democratic values and fundamental rights.
Differentially Private Quantiles with Smaller Error
Jacob Imola · Fabrizio Boninsegna · Hannah Keller · Anders Aamand · Amrita Roy Chowdhury · Rasmus Pagh
In the approximate quantiles problem, the goal is to output $m$ quantile estimates, the ranks of which are as close as possible to $m$ given quantiles $0 \leq q_1 \leq\dots \leq q_m \leq 1$. We present a mechanism for approximate quantiles that satisfies $\varepsilon$-differential privacy for a dataset of $n$ real numbers where the ratio between the distance between the closest pair of points and the size of the domain is bounded by $\psi$. As long as the minimum gap between quantiles is sufficiently large, $|q_i-q_{i-1}|\geq \Omega\left(\frac{m\log(m)\log(\psi)}{n\varepsilon}\right)$ for all $i$, the maximum rank error of our mechanism is $O\left(\frac{\log(\psi) + \log^2(m)}{\varepsilon}\right)$ with high probability. Previously, the best known algorithm under pure DP was due to Kaplan, Schnapp, and Stemmer (ICML '22), who achieved a bound of $O\left(\frac{\log(\psi)\log^2(m) + \log^3(m)}{\varepsilon}\right)$. Our improvement stems from the use of continual counting techniques which allows the quantiles to be randomized in a correlated manner. We also present an $(\varepsilon,\delta)$-differentially private mechanism that relaxes the gap assumption without affecting the error bound, improving on existing methods when $\delta$ is sufficiently close to zero. We provide experimental evaluation which confirms that our mechanism performs favorably compared to prior work in practice, in particular when the number of quantiles $m$ is large.
Efficient Verified Unlearning For Distillation
Yijun Quan · Zushu Li · Giovanni Montana
Growing data privacy demands, driven by regulations like GDPR and CCPA, require machine unlearning methods capable of swiftly removing the influence of specific training points. Although verified approaches like SISA, using data slicing and checkpointing, achieve efficient unlearning for single models by reverting to intermediate states, these methods struggle in teacher-student knowledge distillation settings. Unlearning in the teacher typically forces costly, complete student retraining due to pervasive information propagation during distillation. Our primary contribution is PURGE (Partitioned Unlearning with Retraining Guarantee for Ensembles), a novel framework integrating verified unlearning with distillation. We introduce constituent mapping and an incremental multi-teacher strategy that partitions the distillation process, confines each teacher constituent’s impact to distinct student data subsets, and crucially maintains data isolation. The PURGE framework substantially reduces retraining overhead—requiring only partial student updates—when teacher-side unlearning occurs. We provide both theoretical analysis, quantifying significant speed-ups in the unlearning process, and empirical validation on multiple datasets, demonstrating that PURGE achieves these efficiency gains while maintaining student accuracy comparable to standard baselines.
Quantifying Cross-Modality Memorization in Vision-Language Models
Yuxin Wen · Yangsibo Huang · Tom Goldstein · Ravi Kumar · Badih Ghazi · Chiyuan Zhang
Understanding what and how neural networks memorize during training is crucial, both from the perspective of unintentional memorization of potentially sensitive information and from the standpoint of effective knowledge acquisition for real-world, knowledge-intensive tasks. While previous studies primarily investigate memorization within a single modality, such as text memorization in large language models or image memorization in diffusion models, unified multimodal models are becoming increasingly prevalent in practical applications. In this work, we focus on the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models. To facilitate controlled experiments, we first introduce a synthetic persona dataset comprising diverse synthetic person images and textual descriptions. We quantify factual knowledge memorization and cross-modal transferability by training models on a single modality and evaluating their performance in the other. Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities. Furthermore, we observe that this gap exists across various scenarios, including more capable models, machine unlearning, and the multi-hop case. At the end, we propose a baseline method to mitigate this challenge. We hope our study can inspire future research on developing more robust multimodal learning techniques to enhance cross-modal transferability.
Private Statistical Estimation via Truncation
Manolis Zampetakis · Felix Zhou
We introduce a novel framework for differentially private (DP) statistical estimation via data truncation, addressing a key challenge in DP estimation when the data support is unbounded. Traditional approaches rely on problem-specific sensitivity analysis, limiting their applicability. By leveraging techniques from truncated statistics, we develop computationally efficient DP estimators for exponential family distributions, including Gaussian mean and covariance estimation, achieving near-optimal sample complexity. Previous works on exponential families only consider bounded or one-dimensional families. Our approach mitigates sensitivity through truncation while carefully correcting for the introduced bias using maximum likelihood estimation and DP stochastic gradient descent. Along the way, we establish improved uniform convergence guarantees for the log-likelihood function of exponential families, which may be of independent interest. Our results provide a general blueprint for DP algorithm design via truncated statistics.
Controlling The Spread of Epidemics on Networks with Differential Privacy
Dũng Nguyen · Aravind Srinivasan · Renata Valieva · Anil Vullikanti · Jiayi Wu
Designing effective strategies for controlling epidemic spread by vaccination is an important question in epidemiology, especially in the early stages when vaccines are limited. This is a challenging question when the contact network is very heterogeneous, and strategies based on controlling network properties, such as the degree and spectral radius, have been shown to be effective. Implementation of such strategies requires detailed information on the contact structure, which might be sensitive in many applications. Our focus here is on choosing effective vaccination strategies when the edges are sensitive and differential privacy guarantees are needed. Our main contributions are $(\varepsilon,\delta)$-differentially private algorithms for designing vaccination strategies by reducing the maximum degree and spectral radius. Our key technique is a private algorithm for the multi-set multi-cover problem, which we use for controlling network properties. We evaluate privacy-utility tradeoffs of our algorithms on multiple synthetic and real-world networks, and show their effectiveness.
Multi-Class Support Vector Machine with Differential Privacy
Jinseong Park · Yujin Choi · Jaewook Lee
With the increasing need to safeguard data privacy in machine learning models, differential privacy (DP) is one of the major frameworks to build privacy-preserving models. Support Vector Machines (SVMs) are widely used traditional machine learning models due to their robust margin guarantees and strong empirical performance in binary classification. However, applying DP to multi-class SVMs is inadequate, as the standard one-versus-rest (OvR) and one-versus-one (OvO) approaches repeatedly query each data sample when building multiple binary classifiers, thus consuming the privacy budget proportionally to the number of classes. To overcome this limitation, we explore all-in-one SVM approaches for DP, which access each data sample only once to construct multi-class SVM boundaries with margin maximization properties. We propose a novel differentially Private Multi-class SVM (PMSVM) with weight and gradient perturbation methods, providing rigorous sensitivity and convergence analyses to ensure DP in all-in-one SVMs. Empirical results demonstrate that our approach surpasses existing DP-SVM methods in multi-class scenarios.
Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator
Beier Luo · Shuoyuan Wang · Sharon Li · Hongxin Wei
Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature $\tau$) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM's confidence underestimates PoLM's prediction accuracy on disagreement examples, causing a larger $\tau$ and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large $\tau$ in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08$\%$ on common benchmarks.
Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting
Chenchen Tan · Youyang Qu · Xinghao Li · Hui Zhang · Shujie Cui · Cunjian Chen · Longxiang Gao
The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs' reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning. AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs' linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content. AS realizes these objectives through two attention-level interventions, which are importance-aware suppression applied to the unlearning set to reduce reliance on memorized knowledge and attention-guided retention enhancement that reinforces attention toward semantically essential tokens in the retained dataset to mitigate unintended degradation. These two components are jointly optimized via a dual-loss objective, which forms a soft boundary that localizes unlearning while preserving unrelated knowledge under representation superposition. Experimental results show that AS improves performance preservation over the state-of-the-art unlearning methods, achieving up to 15\% higher accuracy on the ToFU benchmark and 10\% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness. Compared to existing methods, AS demonstrates a superior balance between unlearning effectiveness, generalization, and response reliability.
Keeping an Eye on LLM Unlearning: The Hidden Risk and Remedy
Jie Ren · Zhenwei Dai · Xianfeng Tang · Yue XING · Shenglai Zeng · Hui Liu · Jingying Zeng · Qiankun Peng · Samarth Varshney · Suhang Wang · Qi He · Charu Aggarwal · Hui Liu
Although Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of tasks, growing concerns have emerged over the misuse of sensitive, copyrighted, or harmful data during training. To address these concerns, unlearning techniques have been developed to remove the influence of specific data without retraining from scratch. However, this paper reveals a critical vulnerability in fine-tuning-based unlearning: a malicious user can craft a manipulated forgetting request that stealthily degrades the model’s utility for benign users. We demonstrate this risk through a red-teaming Stealthy Attack (SA), which is inspired by two key limitations of existing unlearning—the inability to constrain the scope of unlearning effect and the failure to distinguish benign tokens from unlearning signals. Prior work has shown that unlearned models tend to memorize forgetting data as unlearning signals, and respond with hallucinations or feigned ignorance when unlearning signals appear in the input. By subtly increasing the presence of common benign tokens in the forgetting data, SA enhances the connection between benign tokens and unlearning signals. As a result, when normal users include such tokens in their prompts, the model exhibits unlearning behaviors, leading to unintended utility degradation. To address this vulnerability, we propose Scope-aware Unlearning (SU), a lightweight enhancement that introduces a scope term into the unlearning objective, encouraging the model to localize the forgetting effect. Our method requires no additional data processing, integrates seamlessly with existing fine-tuning frameworks, and significantly improves robustness against SA. Extensive experiments validate the effectiveness of both SA and SU.
MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
Csaba Dékány · Stefan Balauca · Dimitar I. Dimitrov · Robin Staab · Martin Vechev
Despite recent efforts in Large Language Model (LLM) safety and alignment, current adversarial attacks on frontier LLMs can still consistently force harmful generations. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. At the same time, despite their effectiveness and generalization capabilities, training with continuous perturbations does not always capture the full spectrum of vulnerabilities exploited by discrete attacks. In this work, we aim to bridge this gap by introducing MIXAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MIXAT across a wide spectrum of state-of-the-art attacks, proposing the *At Least One Attack Success Rate* (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MIXAT achieves substantially better robustness (ALO-ASR $< 20\%$) compared to prior defenses (ALO-ASR $> 50\%$), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MIXAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MIXAT discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at https://github.com/insait-institute/MixAT.
BAM-ICL: Causal Hijacking In-Context Learning with Budgeted Adversarial Manipulation
Rui Chu · Bingyin Zhao · Hanling Jiang · Shuchin Aeron · Yingjie Lao
Recent research shows that large language models (LLMs) are vulnerable to hijacking attacks under the scenario of in-context learning (ICL) where LLMs demonstrate impressive capabilities in performing tasks by conditioning on a sequence of in-context examples (ICEs) (i.e., prompts with task-specific input-output pairs). Adversaries can manipulate the provided ICEs to steer the model toward attacker-specified outputs, effectively ''hijacking'' the model's decision-making process. Unlike traditional adversarial attacks targeting single inputs, hijacking attacks in LLMs aim to subtly manipulate the initial few examples to influence the model's behavior across a range of subsequent inputs, which requires distributed and stealthy perturbations. However, existing approaches overlook how to effectively allocate the perturbation budget across ICEs. We argue that fixed budgets miss the potential of dynamic reallocation to improve attack success while maintaining high stealthiness and text quality. In this paper, we propose BAM-ICL, a novel budgeted adversarial manipulation hijacking attack framework for in-context learning. We also consider a more practical yet stringent scenario where ICEs arrive sequentially and only the current ICE can be perturbed. BAM-ICL mainly consists of two stages: In the offline stage, where we assume the adversary has access to data drawn from the same distribution as the target task, we develop a global gradient-based attack to learn optimal budget allocations across ICEs. In the online stage, where ICEs arrive sequentially, perturbations are generated progressively according to the learned budget profile. We evaluate BAM-ICL on diverse LLMs and datasets. The experimental results demonstrate that it achieves superior attack success rates and stealthiness, and the adversarial ICEs are highly transferable to other models.
FedGPS: Statistical Rectification Against Data Heterogeneity in Federated Learning
Zhiqin Yang · Yonggang Zhang · Chenxin Li · Yiu-ming Cheung · Bo Han · Yixuan Yuan
Federated Learning (FL) confronts a significant challenge known as data heterogeneity, which impairs model performance and convergence. Existing methods have made notable progress in addressing this issue. However, improving performance in certain heterogeneity scenarios remains an overlooked question: How robust are these methods to deploy under diverse heterogeneity scenarios? To answer this, we conduct comprehensive evaluations across varied heterogeneity scenarios, showing that most existing methods exhibit limited robustness. Meanwhile, insights from these experiments highlight that sharing statistical information can mitigate heterogeneity by enabling clients to update with a global perspective. Motivated by this, we propose FedGPS (Federated Goal-Path Synergy), a novel framework that seamlessly integrates statistical distribution and gradient information from others. Specifically, FedGPS statically modifies each client’s learning objective to implicitly model the global data distribution using surrogate information, while dynamically adjusting local update directions with gradient information from other clients at each round. Extensive experiments show that FedGPS outperforms state-of-the-art methods across diverse heterogeneity scenarios, validating its effectiveness and robustness. The code is available at: .
DiffEye: Diffusion-Based Continuous Eye-Tracking Data Generation Conditioned on Natural Images
Ozgur Kara · Harris Nisar · James Rehg
Numerous models have been developed for scanpath and saliency prediction, which are typically trained on scanpaths, which model eye movement as a sequence of discrete fixation points connected by saccades, while the rich information contained in the raw trajectories is often discarded. Moreover, most existing approaches fail to capture the variability observed among human subjects viewing the same image. They generally predict a single scanpath of fixed, pre-defined length, which conflicts with the inherent diversity and stochastic nature of real-world visual attention. To address these challenges, we propose DiffEye, a diffusion-based training framework designed to model continuous and diverse eye movement trajectories during free viewing of natural images. Our method builds on a diffusion model conditioned on visual stimuli and introduces a novel component, namely Corresponding Positional Embedding (CPE), which aligns spatial gaze information with the patch-based semantic features of the visual input. By leveraging raw eye-tracking trajectories rather than relying on scanpaths, DiffEye captures the inherent variability in human gaze behavior and generates high-quality, realistic eye movement patterns, despite being trained on a comparatively small dataset. The generated trajectories can also be converted into scanpaths and saliency maps, resulting in outputs that more accurately reflect the distribution of human visual attention. DiffEye is the first method to tackle this task on natural images using a diffusion model while fully leveraging the richness of raw eye-tracking data. Our extensive evaluation shows that DiffEye not only achieves state-of-the-art performance in scanpath generation but also enables, for the first time, the generation of continuous eye movement trajectories. Project webpage: https://diff-eye.github.io/
Fortifying Time Series: DTW-Certified Robust Anomaly Detection
Shijie Liu · Tansu Alpcan · Christopher Leckie · Sarah Erfani
Time-series anomaly detection is critical for ensuring safety in high-stakes applications, where robustness is a fundamental requirement rather than a mere performance metric. Addressing the vulnerability of these systems to adversarial manipulation is therefore essential. Existing defenses are largely heuristic or provide certified robustness only under $\ell_p$-norm constraints, which are incompatible with time-series data. In particular, $\ell_p$-norm fails to capture the intrinsic temporal structure in time series, causing small temporal distortions to significantly alter the $\ell_p$-norm measures. Instead, the similarity metric Dynamic Time Warping (DTW) is more suitable and widely adopted in the time-series domain, as DTW accounts for temporal alignment and remains robust to temporal variations. To date, however, there has been no certifiable robustness result in this metric that provides guarantees. In this work, we introduce the first DTW-certified robust defense in time-series anomaly detection by adapting the randomized smoothing paradigm. We develop this certificate by bridging the $\ell_p$-norm to DTW distance through a lower-bound transformation. Extensive experiments across various datasets and models validate the effectiveness and practicality of our theoretical approach. Results demonstrate significantly improved performance, e.g., up to 18.7\% in F1-score under DTW-based adversarial attacks compared to traditional certified models.
Unveiling Environmental Sensitivity of Individual Gains in Influence Maximization
Xinyan Su · Zhiheng Zhang · Jiyan Qiu · Zhaojuan Yue · Jun Li
Influence Maximization (IM) seeks a seed set to maximize information dissemination in a network. Elegant IM algorithms could naturally extend to cases where each node is equipped with a specific weight, reflecting individual gains to measure its importance. In prevailing literature, these gains are typically assumed to remain constant throughout diffusion and are solvable through explicit formulas based on node characteristics and network topology. However, this assumption is not always feasible due to two key challenges: 1) \textit{Unobservability}: The individual gains of each node are primarily evaluated by the difference between the outputs in the activated and non-activated states. In practice, we can only observe one of these states, with the other remaining unobservable post-propagation. 2) \textit{Environmental sensitivity}: Beyond nodes’ inherent properties, individual gains are also sensitive to the activation status of surrounding nodes, which change dynamically during propagation even when the network topology is fixed. To address these uncertainties, we introduce a Causal Influence Maximization (CauIM) framework, leveraging causal inference techniques to model dynamic individual gains. We propose two algorithms, G-CauIM and A-CauIM, where the latter incorporates a novel acceleration technique. Theoretically, we establish the generalized lower bound of influence spread and provide robustness analysis. Empirically, experiments on synthetic and real-world datasets validate the effectiveness and reliability of our approach.
Let a Neural Network be Your Invariant
Mirco Giacobbe · Daniel Kroening · Abhinandan Pal · Michael Tautschnig
Safety verification ensures that a system avoids undesired behaviour. Liveness complements safety, ensuring that the system also achieves its desired objectives. A complete specification of functional correctness must combine both safety and liveness. Proving with mathematical certainty that a system satisfies a safety property demands presenting an appropriate inductive invariant of the system, whereas proving liveness requires showing a measure of progress witnessed by a ranking function. Neural model checking has recently introduced a data-driven approach to the formal verification of reactive systems, albeit focusing on ranking functions and thus addressing liveness properties only. In this paper, we extend and generalise neural model checking to additionally encompass inductive invariants and thus safety properties as well. Given a system and a linear temporal logic specification of safety and liveness, our approach alternates a learning and a checking component towards the construction of a provably sound neural certificate. Our new method introduces a neural certificate architecture that jointly represents inductive invariants as proofs of safety, and ranking functions as proofs of liveness. Moreover, our new architecture is amenable to training using constraint solvers, accelerating prior neural model checking work otherwise based on gradient descent. We experimentally demonstrate that our method is orders of magnitude faster than the state-of-the-art model checkers on pure liveness and combined safety and liveness verification tasks written in SystemVerilog, while enabling the verification of richer properties than was previously possible for neural model checking.
HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation
Haoran Luo · Haihong E · Guanting Chen · Yandan Zheng · Xiaobao Wu · Yikai Guo · Qika Lin · Yu Feng · Zemin Kuang · Meina Song · Yifan Zhu · Anh Tuan Luu
Standard Retrieval-Augmented Generation (RAG) relies on chunk-based retrieval, whereas GraphRAG advances this approach by graph-based knowledge representation. However, existing graph-based RAG approaches are constrained by binary relations, as each edge in an ordinary graph connects only two entities, limiting their ability to represent the n-ary relations (n >= 2) in real-world knowledge. In this work, we propose HyperGraphRAG, the first hypergraph-based RAG method that represents n-ary relational facts via hyperedges. HyperGraphRAG consists of a comprehensive pipeline, including knowledge hypergraph construction, retrieval, and generation. Experiments across medicine, agriculture, computer science, and law demonstrate that HyperGraphRAG outperforms both standard RAG and previous graph-based RAG methods in answer accuracy, retrieval efficiency, and generation quality.
The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?
Hao Yin · Guangzong Si · Zilei Wang
Contrastive decoding strategies are widely used to reduce object hallucinations in multimodal large language models (MLLMs). These methods work by constructing contrastive samples to induce hallucinations and then suppressing them in the output distribution. However, this paper demonstrates that such approaches fail to effectively mitigate the hallucination problem. The performance improvements observed on POPE Benchmark are largely driven by two misleading factors: (1) crude, unidirectional adjustments to the model’s output distribution and (2) the adaptive plausibility constraint, which reduces the sampling strategy to greedy search. To further illustrate these issues, we introduce a series of spurious improvement methods and evaluate their performance against contrastive decoding techniques. Experimental results reveal that the observed performance gains in contrastive decoding are entirely unrelated to its intended goal of mitigating hallucinations. Our findings challenge common assumptions about the effectiveness of contrastive decoding strategies and pave the way for developing genuinely effective solutions to hallucinations in MLLMs.
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
Mantas Mazeika · Xuwang Yin · Rishub Tamirisa · Jaehyuk Lim · Bruce W, Lee · Richard Ren · Long Phan · Norman Mu · Oliver Zhang · Dan Hendrycks
As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.
Look-Ahead Reasoning on Learning Platforms
Haiqing Zhu · Tijana Zrnic · Celestine Mendler-Dünner
Predictive models are often designed to minimize risk for the learner, yet their objectives do not always align with the interests of the users they affect. Thus, as a way to contest predictive systems, users might act strategically in order to achieve favorable outcomes. While past work has studied strategic user behavior on learning platforms, the focus has largely been on strategic responses to the deployed model, without considering the behavior of other users, or implications thereof for the deployed model. In contrast, *look-ahead reasoning* takes into account that user actions are coupled, and---at scale---impact future predictions. Within this framework, we first formalize level-$k$ thinking, a concept from behavioral economics, where users aim to outsmart their peers by looking one step ahead. We show that, while convergence to an equilibrium is accelerated, the equilibrium remains the same, providing no benefit of higher-level reasoning for individuals in the long run. Then, we focus on collective reasoning, where users take coordinated actions by optimizing through their impact on the model. By contrasting collective with selfish behavior, we characterize the benefits and limits of coordination; a new notion of alignment between the learner's and the users' utilities emerges as a key concept. We discuss connections to several related mathematical frameworks, including strategic classification, performative prediction, and algorithmic collective action.
Unveiling the Uncertainty in Embodied and Operational Carbon of Large AI Models through a Probabilistic Carbon Accounting Model
Xiaoyang Zhang · He Fang · Yang Deng · Dan Wang
The rapid growth of large AI models has raised significant environmental concerns due to their substantial carbon footprint. Existing carbon accounting methods for AI models are fundamentally deterministic and fail to account for inherent uncertainties in embodied and operational carbon emissions. Our work aims to investigate the effect of these uncertainties on embodied and operational carbon footprint estimates for large AI models. We propose a Probabilistic Carbon Accounting Model (PCAM), which quantifies uncertainties in the carbon accounting of large AI models. We develop parameter models to quantify key components (processors, memory, storage) in the carbon footprint of AI models. To characterize the distribution of the parameters, we develop a carbon dataset by aggregating related data from various sources. Then, we generate the probabilistic distribution of the parameters from the collected dataset. We compare the performance of PCAM with LLMCarbon, the state-of-the-art carbon accounting method for large AI models. PCAM achieves $\leq7.44\%$ error compared to LLMCarbon’s $\leq108.51\%$.
Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning
Zhe Hu · Jing Li · Zhongzhu Pu · Hou Pong (Ken) Chan · Yu Yin
Vision Language Models exhibit impressive performance for various tasks, yet they often lack the sophisticated situational reasoning required for complex decision-making. This paper shows that VLMs can achieve surprisingly strong decision-making performance when visual scenes are replaced by textual descriptions, suggesting foundational reasoning can be effectively learned from language. Motivated by this insight, we propose Praxis-VLM, a reasoning VLM for vision-grounded decision-making. Praxis-VLM employs the GRPO algorithm on textual scenarios to instill robust reasoning capabilities, where models learn to evaluate actions and their consequences. These reasoning skills, acquired purely from text, successfully transfer to multimodal inference with visual inputs, significantly reducing reliance on scarce paired image-text training data. Experiments across diverse decision-making benchmarks demonstrate that Praxis-VLM substantially outperforms standard supervised fine-tuning, exhibiting superior performance and generalizability. Further analysis confirms that our models engage in explicit and effective reasoning, underpinning their enhanced performance and adaptability.
Role-aware Multi-agent Reinforcement Learning for Coordinated Emergency Traffic Control
Ming Cheng · Hao Chen · Zhiqing Li · Jia Wang · Senzhang Wang
Emergency traffic control presents an increasingly critical challenge, requiring seamless coordination among emergency vehicles, regular vehicles, and traffic lights to ensure efficient passage for all vehicles. Existing models primarily only focus on traffic light control, leaving emergency and regular vehicles prone to delay due to the lack of navigation strategies. To address this issue, we propose the Role-aware Multi-agent Traffic Control (RMTC) framework, which dynamically assigns appropriate roles to traffic components for better cooperation by considering their relations with emergency vehicles and adaptively adjusting their policies. Specifically, RMTC introduces a Heterogeneous Temporal Traffic Graph (HTTG) to model the spatial and temporal relationships among all traffic components (traffic lights, regular and emergency vehicles) at each time step. Furthermore, we develop a Dynamic Role Learning model to infer the evolving roles of traffic lights and regular vehicles based on HTTG. Finally, we present a Role-aware Multi-agent Reinforcement Learning approach that learns traffic policies conditioned on the dynamically roles. Extensive experiments across four public traffic scenarios show that RMTC outperforms existing traffic light control methods by significantly reducing emergency vehicle travel time, while effectively preserving traffic efficiency for regular vehicles. The code is released at https://anonymous.4open.science/r/RMTC-5E28.
A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis
Dongheng Lin · Mengxue Qu · Kunyang Han · Jianbo Jiao · Xiaojie Jin · Yunchao Wei
Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal, typically outputting only frame-wise anomaly scores without spatial or semantic context. Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific. We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation. Our approach is built upon a chained test-time reasoning process that sequentially connects these tasks, enabling holistic zero-shot anomaly analysis without any additional training. Specifically, our approach leverages intra-task reasoning to refine temporal detections and inter-task chaining for spatial and semantic understanding, yielding improved interpretability and generalization in a fully zero-shot manner. Without any additional data or gradients, our method achieves state-of-the-art zero-shot performance across multiple video anomaly detection, localization, and explanation benchmarks. The results demonstrate that careful prompt design with task-wise chaining can unlock the reasoning power of foundation models, enabling practical, interpretable video anomaly analysis in a fully zero-shot manner. Project Page: https://rathgrith.github.io/UnifiedFrameVAA/.
CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization
Yichen Yan · Ming Zhong · Qi Zhu · Xiaoling Gu · Jinpeng Chen · Huan Li
Multimodal large language models (MLLMs) rely heavily on instruction tuning to align vision and language capabilities, yet the computational cost of training on large-scale datasets remains a major bottleneck. Existing data selection methods aim to mitigate this by selecting important and diverse subsets, but they often suffer from two critical drawbacks: high computational overhead from processing the entire dataset and suboptimal data selection due to separate treatment of importance and diversity. We introduce CoIDO, a novel dual-objective framework that jointly optimizes data importance and diversity to overcome these challenges. Unlike existing approaches that require costly evaluations across the whole dataset, CoIDO employs a lightweight plug-in scorer. This scorer is trained on just a small random sample of data to learn the distribution of the candidate set, drastically reducing computational demands. By leveraging a homoscedastic uncertainty-based formulation, CoIDO effectively balances importance and diversity during training, enabling the scorer to assign CoIDO scores to all data points. This unified scoring approach allows for direct ranking and selection of the most valuable subsets, completely bypassing the need for specialized algorithms. In our experiments, we trained the CoIDO Scorer using only 20% of randomly sampled data. Once trained, CoIDO was applied to the entire dataset to select a 20% subset for instruction tuning. On the widely used LLaVA-1.5-7B model across ten downstream tasks, this selected subset achieved an impressive 98.2% of the performance of full-data fine-tuning, on average. Moreover, CoIDO outperforms all competitors in terms of both efficiency (lowest training FLOPs) and aggregated accuracy. Our code is available at: https://github.com/SuDIS-ZJU/CoIDO
MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement
Jaehyun Nam · Jinsung Yoon · Jiefeng Chen · Jinwoo Shin · Sercan Arik · Tomas Pfister
Agents based on large language models (LLMs) for machine learning engineering (MLE) can automatically implement ML models via code generation. However, existing approaches to build such agents often rely heavily on inherent LLM knowledge and employ coarse exploration strategies that modify the entire code structure at once. This limits their ability to select effective task-specific models and perform deep exploration within specific components, such as experimenting extensively with feature engineering options. To overcome these, we propose MLE-STAR, a novel approach to build MLE agents. MLE-STAR first leverages external knowledge by using a search engine to retrieve effective models from the web, forming an initial solution, then iteratively refines it by exploring various strategies targeting specific ML components. This exploration is guided by ablation studies analyzing the impact of individual code blocks. Furthermore, we introduce a novel ensembling method using an effective strategy suggested by MLE-STAR. Our experimental results show that MLE-STAR achieves medals in 64% of the Kaggle competitions on the MLE-bench, significantly outperforming the best alternative.
Protein Design with Dynamic Protein Vocabulary
Nuowei Liu · Jiahao Kuang · Yanting Liu · Tao Ji · Changzhi Sun · Man Lan · Yuanbin Wu
Protein design is a fundamental challenge in biotechnology, aiming to design novel sequences with specific functions within the vast space of possible proteins. Recent advances in deep generative models have enabled function-based protein design from textual descriptions, yet struggle with structural plausibility. Inspired by classical protein design methods that leverage natural protein structures, we explore whether incorporating fragments from natural proteins can enhance foldability in generative models. Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments based on textual functional descriptions. Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. Compared to state-of-the-art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins, with the proportion of proteins having pLDDT above 70 increasing by 7.38% and those with PAE below 10 increasing by 9.62%.
Ultra-high Resolution Watermarking Framework Resistant to Extreme Cropping and Scaling
Nan Sun · LuYu Yuan · Han Fang · Yuxing Lu · Hefei Ling · Sijing Xie · Chengxin Zhao
Recent developments in DNN-based image watermarking techniques have achieved impressive results in protecting digital content. However, most existing methods are constrained to low-resolution images as they need to encode the entire image, leading to prohibitive memory and computational costs when applied to high-resolution images. Moreover, they lack robustness to distortions prevalent in large-image transmission, such as extreme scaling and random cropping. To address these issues, we propose a novel watermarking method based on implicit neural representations (INRs). Leveraging the properties of INRs, our method employs resolution-independent coordinate sampling mechanism to generate watermarks pixel-wise, achieving ultra-high resolution watermark generation with fixed and limited memory and computational resources. This design ensures strong robustness in watermark extraction, even under extreme cropping and scaling distortions. Additionally, we introduce a hierarchical multi-scale coordinate embedding and a low-rank watermark injection strategy to ensure high-quality watermark generation and robust decoding. Experimental results demonstrate that our method significantly outperforms existing schemes in terms of both robustness and computational efficiency while preserving high image quality. Our approach achieves an accuracy greater than 98\% in watermark extraction with only 0.4\% of the image area in 2K images. These results highlight the effectiveness of our method, making it a promising solution for large-scale and high-resolution image watermarking applications.
MiCADangelo: Fine-Grained Reconstruction of Constrained CAD Models from 3D Scans
Ahmet Karadeniz · Dimitrios Mallis · Danila Rukhovich · Kseniya Cherenkova · Anis Kacem · Djamila Aouada
Computer-Aided Design (CAD) plays a foundational role in modern manufacturing and product development, often requiring designers to modify or build upon existing models. Converting 3D scans into parametric CAD representations—a process known as CAD reverse engineering—remains a significant challenge due to the high precision and structural complexity of CAD models. Existing deep learning-based approaches typically fall into two categories: bottom-up, geometry-driven methods, which often fail to produce fully parametric outputs, and top-down strategies, which tend to overlook fine-grained geometric details. Moreover, current methods neglect an essential aspect of CAD modeling: sketch-level constraints. In this work, we introduce a novel approach to CAD reverse engineering inspired by how human designers manually perform the task. Our method leverages multi-plane cross-sections to extract 2D patterns and capture fine parametric details more effectively. It enables the reconstruction of detailed and editable CAD models, outperforming state-of-the-art methods and, for the first time, incorporating sketch constraints directly into the reconstruction process.
Tail-Optimized Caching for LLM Inference
Wenxin Zhang · Yueying Li · Ciamac C Moallemi · Tianyi Peng
Prompt caching is critical for reducing latency and cost in LLM inference---OpenAI and Anthropic report up to 50–90\% cost savings through prompt reuse. Despite its widespread success, little is known about what constitutes an optimal prompt caching policy, particularly when optimizing tail latency—a metric of central importance to practitioners. The widely used Least Recently Used (LRU) policy can perform arbitrarily poor on this metric, as it is oblivious to the heterogeneity of conversation lengths. To address this gap, we propose Tail-Optimized LRU, a simple two-line modification that reallocates KV cache capacity to prioritize high-latency conversations by evicting cache entries that are unlikely to affect future turns. Though the implementation is simple, we prove its optimality under a natural stochastic model of conversation dynamics, providing the first theoretical justification for LRU in this setting---a result that may be of independent interest to the caching community. Experimentally, on real conversation data WildChat~\citep{zhao2024wildchat}, Tail-Optimized LRU achieves up to 27.5\% reduction in P90 tail Time to First Token latency and 23.9\% in P95 tail latency compared to LRU, along with up to 38.9\% decrease in SLO violations of 200ms. We believe this provides a practical and theoretically grounded option for practitioners seeking to optimize tail latency in real-world LLM deployments.
OpenCUA: Open Foundations for Computer-Use Agents
Xinyuan Wang · Bowen Wang · Dunjie Lu · Junlin Yang · Tianbao Xie · Junli Wang · Jiaqi Deng · Xiaole Guo · Yiheng Xu · Chen Wu · Zhennan Shen · Zhuokai Li · Ryan Li · Xiaochuan Li · Junda Chen · Boyuan Zheng · LI PEIHANG · Fangyu Lei · Ruisheng Cao · Yeqiao Fu · Dongchan Shin · Martin Shin · Hu Jiarui · Yuyan Wang · Jixuan Chen · Yuxiao Ye · Danyang Zhang · Yipu Wang · Heng Wang · Diyi Yang · Victor Zhong · Y.Charles · Zhilin Yang · Tao Yu
Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state–action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-72B achieves an average success rate of 45.0% on OSWorld‑Verified, establishing a new state-of-the-art (SOTA) among open-source models. Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.
TranSUN: A Preemptive Paradigm to Eradicate Retransformation Bias Intrinsically from Regression Models in Recommender Systems
Jiahao Yu · Haozhuang Liu · Yeqiu Yang · Lu Chen · Jian Wu · Yuning Jiang · Bo Zheng
Regression models are crucial in recommender systems. However, retransformation bias problem has been conspicuously neglected within the community. While many works in other fields have devised effective bias correction methods, all of them are post-hoc cures externally to the model, facing practical challenges when applied to real-world recommender systems. Hence, we propose a preemptive paradigm to eradicate the bias intrinsically from the models via minor model refinement. Specifically, a novel TranSUN method is proposed with a joint bias learning manner to offer theoretically guaranteed unbiasedness under empirical superior convergence. It is further generalized into a novel generic regression model family, termed Generalized TranSUN (GTS), which not only offers more theoretical insights but also serves as a generic framework for flexibly developing various bias-free models. Comprehensive experimental results demonstrate the superiority of our methods across data from various domains, which have been successfully deployed in two real-world industrial recommendation scenarios, i.e. product and short video recommendation scenarios in Guess What You Like business domain in the homepage of Taobao App (a leading e-commerce platform with DAU > 300M), to serve the major online traffic.
CURV: Coherent Uncertainty-Aware Reasoning in Vision-Language Models for X-Ray Report Generation
Ziao Wang · Sixing Yan · Kejing Yin · Xiaofeng Zhang · William K. Cheung
Vision-language models have been explored for radiology report generation with promising results. Yet, uncertainty elaborated in findings and the reasoning process for reaching clinical impressions are seldom explicitly modeled, reducing the clinical accuracy and trustworthiness of the generated reports. We present CURV, a novel framework that alleviates the limitations through integrated awareness of uncertainty and explicit reasoning capabilities. Our approach consists of three key components: (1) an uncertainty modeling mechanism that teaches the model to recognize and express appropriate levels of diagnostic confidence, (2) a structured reasoning framework that generates intermediate explanatory steps connecting visual findings to clinical impressions, and (3) a reasoning coherence reward that ensures logical consistency among findings, reasoning, and impressions. We implement CURV through a three-stage training pipeline that combines uncertainty-aware fine-tuning, reasoning initialization, and reinforcement learning. In particular, we adopt a comprehensive reward function addresses multiple aspects of report quality, incorporating medical term matching, uncertainty expression evaluation, and semantic coherence evaluation. Experimental results demonstrate that CURV generates clinically relevant reports with appropriate uncertainty expressions and transparent reasoning traces, significantly outperforming previous methods. CURV represents a substantial advancement toward interpretable and trustworthy AI-generated radiology reports, with broader implications for the deployment of vision-language models in high-stakes clinical environments where uncertainty awareness and reasoning transparency are essential.
RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics
Jie Zhang · Cezara Petrui · Kristina Nikolić · Florian Tramer
Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions---failing to capture the nature of mathematics encountered in actual research environments. We introduce \textsc{RealMath}, a novel benchmark derived directly from research papers and mathematical forums that assesses LLMs' abilities on authentic mathematical tasks. Our approach addresses three critical challenges: sourcing diverse research-level content, enabling reliable automated evaluation through verifiable statements, and designing a continually refreshable dataset to mitigate contamination risks. Experimental results across multiple LLMs reveal surprising capabilities in handling research mathematics compared to competition problems, suggesting current models may already serve as valuable assistants for working mathematicians despite limitations on highly challenging problems. The code and dataset for \textsc{RealMath} are publicly available.
Evaluating Program Semantics Reasoning with Type Inference in System $F$
Yifeng He · Luning Yang · Christopher Gonzalo · Hao Chen
Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem.Their test-time compute reasoning capabilities promise significant potential in understanding program logic and semantics beyond mere token recognition. However, current benchmarks evaluating reasoning LLMs for code lack a formal, program-centric deductive framework for the soundness of evaluation, incompetent in assessing of whether models genuinely reason about program semantics or merely associate superficial connections between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as *program semantics reasoning*. By employing verified transformations to remove semantically irrelevant natural language,we construct TF-Bench_pure, a purely semantics-driven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3.7-sonnet) achieving only $55.85\%$ accuracy on TF-Bench_pure. Additionally, we propose two novel metrics to assess the robustness and effectiveness of extended reasoning,underscoring the critical limitation in current LLM capabilities and highlighting essential directions for future research.
Crucible: Quantifying the Potential of Control Algorithms through LLM Agents
Lianchen Jia · Chaoyang Li · Qian Houde · Tianchi Huang · Jiangchuan Liu · Lifeng Sun
Control algorithms in production environments typically require domain experts to tune their parameters and logic for specific scenarios. However, existing research predominantly focuses on algorithmic performance under ideal or default configurations, overlooking the critical aspect of Tuning Potential. To bridge this gap, we introduce \texttt{Crucible}, an agent that employs an LLM-driven, multi-level expert simulation to turn algorithms and defines a formalized metric to quantitatively evaluate their Tuning Potential. We demonstrate \texttt{Crucible}'s effectiveness across a wide spectrum of case studies, from classic control tasks to complex computer systems, and validate its findings in a real-world deployment. Our experimental results reveal that \texttt{Crucible} systematically quantifies the tunable space across different algorithms. Furthermore, \texttt{Crucible} provides a new dimension for algorithm analysis and design, which ultimately leads to performance improvements. Our code is available at https://github.com/thu-media/Crucible.
FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes
Christodoulos Constantinides · Dhaval Patel · Shuxin Lin · Claudio Guerrero · Sunil Patil · Jayant Kalagnanam
We introduce FailureSensorIQ, a novel Multi-Choice Question-Answering (MCQA) benchmarking system designed to assess the ability of Large Language Models (LLMs) to reason and understand complex, domain-specific scenarios in Industry 4.0. Unlike traditional QA benchmarks, our system focuses on multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets. Through this work, we envision a paradigm shift where modeling decisions are not only data-driven using statistical tools like correlation analysis and significance tests, but also domain-driven by specialized LLMs which can reason about the key contributors and useful patterns that can be captured with feature engineering.We evaluate the Industrial knowledge of over a dozen LLMs including GPT-4, Llama, and Mistral on FailureSensorIQ from different lens using Perturbation-Uncertainty-Complexity analysis, Expert Evaluation study, Asset-Specific Knowledge Gap analysis, ReAct agent using external knowledge-bases.Even though closed-source models with strong reasoning capabilities approach expert-level performance, the comprehensive benchmark reveals a significant drop in performance that is fragile to perturbations, distractions, and inherent knowledge gaps in the models.We also provide a real-world case study of how LLMs can drive the modeling decisions on 3 different failure prediction datasets related to various assets.We release: (a) expert-curated MCQA for various industrial assets, (b) FailureSensorIQ benchmark and Hugging Face leaderboard based on MCQA built from non-textual data found in ISO documents, and (c) ``LLMFeatureSelector'', an LLM-based feature selection scikit-learn pipeline. The software is available at https://github.com/IBM/FailureSensorIQ.
Position: Biology is the Challenge Physics-Informed ML Needs to Evolve
Julien Martinelli
Physics-Informed Machine Learning (PIML) has successfully integrated mechanistic understanding into machine learning, particularly in domains governed by well-known physical laws.This success has motivated efforts to apply PIML to biology, a field rich in dynamical systems but shaped by different constraints.Biological modeling, however, presents unique challenges: multi-faceted and uncertain prior knowledge, heterogeneous and noisy data, partial observability, and complex, high-dimensional networks.\textbf{In this position paper, we argue that these challenges should not be seen as obstacles to PIML, but as catalysts for its evolution. We propose Biology-Informed Machine Learning (BIML): a principled extension of PIML that retains its structural grounding while adapting to the practical realities of biology.}Rather than replacing PIML, BIML retools its methods to operate under softer, probabilistic forms of prior knowledge.We outline four foundational pillars as a roadmap for this transition: uncertainty quantification, contextualization, constrained latent structure inference, and scalability.Foundation Models and Large Language Models will be key enablers, bridging human expertise with computational modeling.We conclude with concrete recommendations to build the BIML ecosystem and channel PIML-inspired innovation toward challenges of high scientific and societal relevance.
When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human-AI Collaboration
Quan Shi · Carlos Jimenez · Shunyu Yao · Nick Haber · Diyi Yang · Karthik Narasimhan
As large language models (LLMs) increasingly serve as close collaborators for humans, it is crucial that they express their reasoning in ways that humans can understand and learn from. However, this capability remains relatively less understood and under-evaluated. To address this, we introduce a conceptual framework for such Human-AI knowledge transfer capabilities and conduct the first large-scale user study (N=118) explicitly designed to measure it. In our two-phase setup, humans first ideate with an LLM on problem-solving strategies, then independently implement solutions, isolating the influence of model reasoning on human understanding. Our findings reveal that while model benchmark performance correlates with collaborative outcomes, this relationship is notably inconsistent with significant outliers, highlighting that knowledge transfer is a distinct capability requiring dedicated optimization. Our analysis uncovers behavioral and strategic factors that mediate successful knowledge transfer, and we release our code, dataset, and evaluation framework to support future work on communicatively aligned models.
FuXi-Ocean: A Global Ocean Forecasting System with Sub-Daily Resolution
Qiusheng Huang · Yuan Niu · Xiaohui Zhong · AnboyuGuo · Lei Chen · dianjun zhang · Xuefeng Zhang · Hao Li
Accurate, high-resolution ocean forecasting is crucial for maritime operations and environmental monitoring. While traditional numerical models are capable of producing sub-daily, eddy-resolving forecasts, they are computationally intensive and face challenges in maintaining accuracy at fine spatial and temporal scales. In contrast, recent data-driven approaches offer improved computational efficiency and emerging potential, yet typically operate at daily resolution and struggle with sub-daily predictions due to error accumulation over time. We introduce FuXi-Ocean, the first data-driven global ocean forecasting model achieving six-hourly predictions at eddy-resolving 1/12° spatial resolution, reaching depths of up to 1500 meters. The model architecture integrates a context-aware feature extraction module with a predictive network employing stacked attention blocks. The core innovation is the Mixture-of-Time (MoT) module, which adaptively integrates predictions from multiple temporal contexts by learning variable-specific reliability , mitigating cumulative errors in sequential forecasting. Through comprehensive experimental evaluation, FuXi-Ocean demonstrates superior skill in predicting key variables, including temperature, salinity, and currents, across multiple depths.
Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go
Yichuan Ma · Linyang Li · Yongkang Chen · Peiji Li · Jiasheng Ye · Qipeng Guo · Dahua Lin · Kai Chen
Large language models (LLMs) have demonstrated exceptional performance in reasoning tasks such as mathematics and coding, matching or surpassing human capabilities. However, these impressive reasoning abilities face significant challenges in specialized domains. Taking Go as an example, although AlphaGo has established the high performance ceiling of AI systems in Go, mainstream LLMs still struggle to reach even beginner-level proficiency, let alone perform natural language reasoning. This performance gap between general-purpose LLMs and domain experts is significantly limiting the application of LLMs on a wider range of domain-specific tasks. In this work, we aim to bridge the divide between LLMs' general reasoning capabilities and expert knowledge in domain-specific tasks. We perform mixed fine-tuning with structured Go expertise and general long Chain-of-Thought (CoT) reasoning data as a cold start, followed by reinforcement learning to integrate expert knowledge in Go with general reasoning capabilities. Through this methodology, we present LoGos, a powerful LLM that not only maintains outstanding general reasoning abilities, but also conducts Go gameplay in natural language, demonstrating effective strategic reasoning and accurate next-move prediction. LoGos achieves performance comparable to human professional players, substantially surpassing all existing LLMs. Through this work, we aim to contribute insights on applying general LLM reasoning capabilities to specialized domains. We will release the first large-scale Go dataset for LLM training, the first LLM Go evaluation benchmark, and the first general LLM that reaches human expert-level performance in Go.
QiMeng-MuPa: Mutual-Supervised Learning for Sequential-to-Parallel Code Translation
Changxin Ke · Rui Zhang · Shuo Wang · Li Ding · Guangli Li · Yuanbo Wen · Shuoming Zhang · Ruiyuan Xu · Jin Qin · Jiaming Guo · Chenxi Wang · Ling Li · Qi Guo · Yunji Chen
The rise of GPU-based high-performance computing (HPC) has driven the widespread adoption of parallel programming models such as CUDA. Yet, the inherent complexity of parallel programming creates a demand for the automated sequential-to-parallel approaches. However, data scarcity poses a significant challenge for machine learning-based sequential-to-parallel code translation. Although recent back-translation methods show promise, they still fail to ensure functional equivalence in the translated code. In this paper, we propose \textbf{QiMeng-MuPa}, a novel \textbf{Mu}tual-Supervised Learning framework for Sequential-to-\textbf{Pa}rallel code translation, to address the functional equivalence issue. QiMeng-MuPa consists of two models, a Translator and a Tester. Through an iterative loop consisting of Co-verify and Co-evolve steps, the Translator and the Tester mutually generate data for each other and improve collectively. The Tester generates unit tests to verify and filter functionally equivalent translated code, thereby evolving the Translator, while the Translator generates translated code as augmented input to evolve the Tester. Experimental results demonstrate that QiMeng-MuPa significantly enhances the performance of the base models: when applied to Qwen2.5-Coder, it not only improves Pass@1 by up to 28.91\% and boosts Tester performance by 68.90\%, but also outperforms the previous state-of-the-art method CodeRosetta by 1.56 and 6.92 in BLEU and CodeBLEU scores, while achieving performance comparable to DeepSeek-R1 and GPT-4.1. Our code is available at \url{https://github.com/kcxain/mupa}.
From Pose to Muscle: Multimodal Learning for Piano Hand Muscle Electromyography
RUOFAN LIU · YICHEN PENG · Takanori Oku · Chen-Chieh Liao · Erwin Wu · Shinichi Furuya · Hideki Koike
Muscle coordination is fundamental when humans interact with the world. Reliable estimation of hand muscle engagement can serve as a source of internal feedback, supporting the development of embodied intelligence and the acquisition of dexterous skills. However, contemporary electromyography (EMG) sensing techniques either require prohibitively expensive devices or are constrained to gross motor movements, which inherently involve large muscles. On the other hand, EMGs exhibit dependency on individual anatomical variability and task-specific contexts, resulting in limited generalization. In this work, we preliminarily investigate the latent pose-EMG correspondence using a general EMG gesture dataset. We further introduce a multimodal dataset, PianoKPM Dataset, and a hand muscle estimation framework, PianoKPM Net, to facilitate high-fidelity EMG inference. Subsequently, our approach is compared against reproducible competitive baselines. The generalization and adaptation across unseen users and tasks are evaluated by quantifying the training set scale and the included data amount.
PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling
Xiao Yu · Yan Fang · Yao Zhao · Yunchao Wei
Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the $\textbf{Pre}$dictive $\textbf{F}$uture $\textbf{M}$odeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and (b) modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization. Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding.
REDOUBT: Duo Safety Validation for Autonomous Vehicle Motion Planning
Shuguang Wang · Qian Zhou · Kui Wu · Dapeng Wu · Wei-Bin Lee · Jianping Wang
Safety validation, which assesses the safety of an autonomous system's motion planning decisions, is critical for the safe deployment of autonomous vehicles. Existing input validation techniques from other machine learning domains, such as image classification, face unique challenges in motion planning due to its contextual properties, including complex inputs and one-to-many mapping. Furthermore, current output validation methods in autonomous driving primarily focus on open-loop trajectory prediction, which is ill-suited for the closed-loop nature of motion planning. We introduce REDOUBT, the first systematic safety validation framework for autonomous vehicle motion planning that employs a duo mechanism, simultaneously inspecting input distributions and output uncertainty. REDOUBT identifies previously overlooked unsafe modes arising from the interplay of In-Distribution/Out-of-Distribution (OOD) scenarios and certain/uncertain planning decisions. We develop specialized solutions for both OOD detection via latent flow matching and decision uncertainty estimation via an energy-based approach. Our extensive experiments demonstrate that both modules outperform existing approaches, under both open-loop and closed-loop evaluation settings. Our codes are available at: https://github.com/sgNicola/Redoubt.
3D Human Pose Estimation with Muscles
Kevin Zhu · AliAsghar MohammadiNasrabadi · Alexander Wong · John McPhee
We introduce MusclePose as an end-to-end learnable physics-infused 3D human pose estimator that incorporates muscle-dynamics modeling to infer human dynamics from monocular video. Current physics pose estimators aim to predict physically plausible poses by enforcing the underlying dynamics equations that govern motion. Since this is an underconstrained problem without force-annotated data, methods often estimate kinetics with external physics optimizers that may not be compatible with existing learning frameworks, or are too slow for real-time inference. While more recent methods use a regression-based approach to overcome these issues, the estimated kinetics can be seen as auxiliary predictions, and may not be physically plausible. To this end, we build on existing regression-based approaches, and aim to improve the biofidelity of kinetic inference with a multihypothesis approach --- by inferring joint torques via Lagrange’s equations and via muscle dynamics modeling with muscle torque generators. Furthermore, MusclePose predicts detailed human anthropometrics based on values from biomechanics studies, in contrast to existing physics pose estimators that construct their human models with shape primitives. We show that MusclePose is competitive with existing 3D pose estimators in positional accuracy, while also able to infer plausible human kinetics and muscle signals consistent with values from biomechanics studies, without requiring an external physics engine.
Which Algorithms Have Tight Generalization Bounds?
Michael Gastpar · Ido Nachum · Jonathan Shafer · Thomas Weinberger
We study which machine learning algorithms have tight generalization bounds with respect to a given collection of population distributions. Our results build on and extend the recent work of Gastpar et al. (2023). First, we present conditions that preclude the existence of tight generalization bounds. Specifically, we show that algorithms that have certain inductive biases that cause them to be unstable do not admit tight generalization bounds. Next, we show that algorithms that are sufficiently loss-stable do have tight generalization bounds. We conclude with a simple characterization that relates the existence of tight generalization bounds to the conditional variance of the algorithm's loss.
A Standardized Benchmark for Multilabel Antimicrobial Peptide Classification
Sebastian Ojeda · Rafael Velasquez · Nicolás Aparicio · Juanita Puentes · Paula Cárdenas · Nicolás Andrade · Gabriel González · Sergio Rincón · Carolina Muñoz-Camargo · Pablo Arbelaez
Antimicrobial peptides have emerged as promising molecules to combat antimicrobial resistance. However, fragmented datasets, inconsistent annotations, and the lack of standardized benchmarks hinder computational approaches and slow down the discovery of new candidates. To address these challenges, we present the Expanded Standardized Collection for Antimicrobial Peptide Evaluation (ESCAPE), an experimental framework integrating over 80.000 peptides from 27 validated repositories. Our dataset separates antimicrobial peptides from negative sequences and incorporates their functional annotations into a biologically coherent multilabel hierarchy, capturing activities across antibacterial, antifungal, antiviral, and antiparasitic classes. Building on ESCAPE, we propose a transformer-based model that leverages sequence and structural information to predict multiple functional activities of peptides. Our method achieves up to a 2.56% relative average improvement in mean Average Precision over the second-best method adapted for this task, establishing a new state-of-the-art multilabel peptide classification. ESCAPE provides a comprehensive and reproducible evaluation framework to advance AI-driven antimicrobial peptide research.
Rationalized All-Atom Protein Design with Unified Multi-Modal Bayesian Flow
Hanlin Wu · Yuxuan Song · Zhe Zhang · Zhilong Zhang · Hao Zhou · Wei-Ying Ma · Jingjing Liu
Designing functional proteins is a critical yet challenging problem due to the intricate interplay between backbone structures, sequences, and side-chains. Current approaches often decompose protein design into separate tasks, which can lead to accumulated errors, while recent efforts increasingly focus on all-atom protein design. However, we observe that existing all-atom generation approaches suffering from an information shortcut issue, where models inadvertently infer sequences from side-chain information, compromising their ability to accurately learn sequence distributions. To address this, we introduce a novel rationalized information flow strategy to eliminate the information shortcut. Furthermore, motivated by the advantages of Bayesian flows over differential equation–based methods, we propose the first Bayesian flow formulation for protein backbone orientations by recasting orientation modeling as an equivalent hyperspherical generation problem with antipodal symmetry. To validate, our method delivers consistently exceptional performance in both peptide and antibody design tasks.
DMol: A Highly Efficient and Chemical Motif-Preserving Molecule Generation Platform
Peizhi Niu · Yu-Hsiang Wang · Vishal Rana · Chetan Rupakheti · Abhishek Pandey · Olgica Milenkovic
We introduce a new graph diffusion model for small drug molecule generation which simultaneously offers a 10-fold reduction in the number of diffusion steps when compared to existing methods, preservation of small molecule graph motifs via motif compression, and an average 3\% improvement in SMILES validity over the DiGress model across all real-world molecule benchmarking datasets. Furthermore, our approach outperforms the state-of-the-art DeFoG method with respect to motif-conservation by roughly 4\%, as evidenced by high ChEMBL-likeness, QED and newly introduced shingles distance scores. The key ideas behind the approach are to use a combination of deterministic and random subgraph perturbations, so that the node and edge noise schedules are codependent; to modify the loss function of the training process in order to exploit the deterministic component of the schedule; and, to ''compress'' a collection of highly relevant carbon ring and other motif structures into supernodes in a way that allows for simple subsequent integration into the molecular scaffold.
LLM Meets Diffusion: A Hybrid Framework for Crystal Material Generation
Subhojyoti Khastagir · KISHALAY DAS · Pawan Goyal · Seung-Cheol Lee · Satadeep Bhattacharjee · niloy ganguly
Recent advances in generative modeling have shown significant promise in designing novel periodic crystal structures. Existing approaches typically rely on either large language models (LLMs) or equivariant denoising models, each with complementary strengths: LLMs excel at handling discrete atomic types but often struggle with continuous features such as atomic positions and lattice parameters, while denoising models are effective at modeling continuous variables but encounter difficulties in generating accurate atomic compositions. To bridge this gap, we propose CrysLLMGen, a hybrid framework that integrates an LLM with a diffusion model to leverage their complementary strengths for crystal material generation. During sampling, CrysLLMGen first employs a fine-tuned LLM to produce an intermediate representation of atom types, atomic coordinates, and lattice structure. While retaining the predicted atom types, it passes the atomic coordinates and lattice structure to a pre-trained equivariant diffusion model for refinement. Our framework outperforms state-of-the-art generative models across several benchmark tasks and datasets. Specifically, CrysLLMGen not only achieves a balanced performance in terms of structural and compositional validity but also generates more stable and novel materials compared to LLM-based and denoising-based models Furthermore, CrysLLMGen exhibits strong conditional generation capabilities, effectively producing materials that satisfy user-defined constraints. Code is available at \url{https://github.com/kdmsit/crysllmgen}
Reaction Prediction via Interaction Modeling of Symmetric Difference Shingle Sets
Runhan Shi · Letian Chen · Gufeng Yu · Yang Yang
Chemical reaction prediction remains a fundamental challenge in organic chemistry, where existing machine learning models face two critical limitations: sensitivity to input permutations (molecule/atom orderings) and inadequate modeling of substructural interactions governing reactivity. These shortcomings lead to inconsistent predictions and poor generalization to real-world scenarios. To address these challenges, we propose ReaDISH, a novel reaction prediction model that learns permutation-invariant representations while incorporating interaction-aware features. It introduces two innovations: (1) symmetric difference shingle encoding, which computes molecular shingle differences to capture reaction-specific structural changes while eliminating order sensitivity; and (2) geometry-structure interaction attention, a mechanism that models intra- and inter-molecular interactions at the shingle level. Extensive experiments demonstrate that ReaDISH improves reaction prediction performance across diverse benchmarks. It shows enhanced robustness with an average improvement of 8.76\% on R$^2$ under permutation perturbations.
Unified all-atom molecule generation with neural fields
Matthieu Kirchmeyer · Pedro O. Pinheiro · Emma Willett · Karolis Martinkus · Joseph Kleinhenz · Emily Makowski · Andrew Watkins · Vladimir Gligorijevic · Richard Bonneau · Saeed Saremi
Generative models for structure-based drug design are often limited to a specific modality, restricting their broader applicability. To address this challenge, we introduce FuncBind, a framework based on computer vision to generate target-conditioned, all-atom molecules across atomic systems. FuncBind uses neural fields to represent molecules as continuous atomic densities and employs score-based generative models with modern architectures adapted from the computer vision literature. This modality-agnostic representation allows a single unified model to be trained on diverse atomic systems, from small to large molecules, and handle variable atom/residue counts, including non-canonical amino acids. FuncBind achieves competitive in silico performance in generating small molecules, macrocyclic peptides, and antibody complementarity-determining region loops, conditioned on target structures. FuncBind also generated in vitro novel antibody binders via de novo redesign of the complementarity-determining region H3 loop of two chosen co-crystal structures. As a final contribution, we introduce a new dataset and benchmark for structure-conditioned macrocyclic peptide generation.
Amortized Sampling with Transferable Normalizing Flows
Charlie Tan · Majdi Hassan · Leon Klein · Saifuddin Syed · Dominique Beaini · Michael Bronstein · Alexander Tong · Kirill Neklyudov
Efficient equilibrium sampling of molecular conformations remains a core challenge in computational chemistry and statistical inference. Classical approaches such as molecular dynamics or Markov chain Monte Carlo inherently lack amortization; the computational cost of sampling must be paid in-full for each system of interest. The widespread success of generative models has inspired interest into overcoming this limitation through learning sampling algorithms. Despite performing on par with conventional methods when trained on a single system, learned samplers have so far demonstrated limited ability to transfer across systems. We prove that deep learning enables the design of scalable and transferable samplers by introducing Ensemble, a 200 million parameter all-atom transferable normalizing flow trained on a corpus of peptide molecular dynamics trajectories up to 8 residues in length. Ensemble draws zero-shot uncorrelated proposal samples for arbitrary peptide systems, achieving the previously intractable transferability across sequence length, whilst retaining the efficient likelihood evaluation of normalizing flows. Through extensive empirical evaluation we demonstrate the efficacy of Ensemble as a proposal for a variety of sampling algorithms, finding a simple importance sampling-based finetuning procedure to achieve superior performance to established methods such as sequential Monte Carlo. We open-source the Ensemble codebase, model weights, and training dataset, to further stimulate research into amortized sampling methods and finetuning objectives.
HollowFlow: Efficient Sample Likelihood Evaluation using Hollow Message Passing
Johann Flemming Gloy · Simon Olsson
Flow and diffusion-based models have emerged as powerful tools for scientific applications, particularly for sampling non-normalized probability distributions, as exemplified by Boltzmann Generators (BGs). A critical challenge in deploying these models is their reliance on sample likelihood computations, which scale prohibitively with system size $n$, often rendering them infeasible for large-scale problems. To address this, we introduce $\textit{HollowFlow}$, a flow-based generative model leveraging a novel non-backtracking graph neural network (NoBGNN). By enforcing a block-diagonal Jacobian structure, HollowFlow likelihoods are evaluated with a constant number of backward passes in $n$, yielding speed-ups of up to $\mathcal{O}(n^2)$: a significant step towards scaling BGs to larger systems. Crucially, our framework generalizes: $\textbf{any equivariant GNN or attention-based architecture}$ can be adapted into a NoBGNN. We validate HollowFlow by training BGs on two different systems of increasing size. For both systems, the sampling and likelihood evaluation time decreases dramatically, following our theoretical scaling laws. For the larger system we obtain a $10^2\times$ speed-up, clearly illustrating the potential of HollowFlow-based approaches for high-dimensional scientific problems previously hindered by computational bottlenecks.
MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation
Yang Han · Pengyu Wang · Kai Yu · xin chen · Lu Chen
Mass spectrometry (MS) plays a critical role in molecular identification, significantly advancing scientific discovery. However, structure elucidation from MS data remains challenging due to the scarcity of annotated spectra. While large-scale pretraining has proven effective in addressing data scarcity in other domains, applying this paradigm to mass spectrometry is hindered by the complexity and heterogeneity of raw spectral signals. To address this, we propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary, enabling cross-modal learning through large-scale pretraining on reliably computed fingerprint–molecule datasets. Multi-task pretraining objectives further enhance MS-BART's generalization by jointly optimizing denoising and translation task. The pretrained model is subsequently transferred to experimental spectra through finetuning on fingerprint predictions generated with MIST, a pre-trained spectral inference model, thereby enhancing robustness to real-world spectral variability. While finetuning alleviates the distributional difference, MS-BART still suffers molecular hallucination and requires further alignment. We therefore introduce a chemical feedback mechanism that guides the model toward generating molecules closer to the reference structure. Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1 and is faster by one order of magnitude than competing diffusion-based methods, while comprehensive ablation studies systematically validate the model's effectiveness and robustness. We provide the data and code at https://github.com/OpenDFM/MS-BART.
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance
Myeongsoo Kim · Shweta Garg · Baishakhi Ray · Varun Kumar · Anoop Deoras
Programming assistants powered by large language models have transformed software development, yet most benchmarks focus narrowly on code generation tasks. Recent efforts like InfiBench and StackEval attempt to address this gap using Stack Overflow data but remain limited to single-turn interactions in isolated contexts, require significant manual curation, and fail to represent complete project environments. We introduce CodeAssistBench (CAB), the first benchmark framework for evaluating multi-turn programming assistance in realistic settings that address questions grounded in actual codebases. Unlike existing programming Q&A benchmarks, CAB automatically generates scalable datasets from GitHub issues tagged with questions using configurable parameters (e.g., repository creation date, star count, programming languages), and includes automatic containerization of codebases for evaluation. It then evaluates models through simulated users in these containerized environments with full codebase access. Using this framework, we constructed a test set of 3,286 real-world programming questions across 214 repositories, spanning seven programming languages and diverse problem domains. Our evaluation of leading LLMs reveals a substantial capability gap: while models perform well on Stack Overflow questions with success rates of 70-83%, they resolve only up to 16.49% of CAB's issues from recent repositories (post-training cutoff). This discrepancy highlights the challenges of providing assistance in complex, project-specific contexts versus answering standalone questions. Our fully automated framework enables continuous benchmark expansion and is available at https://github.com/amazon-science/CodeAssistBench/.
AlphaFold Database Debiasing for Robust Inverse Folding
Cheng Tan · Zhenxiao Cao · Zhangyang Gao · Siyuan Li · Yufei Huang · Stan Z. Li
The AlphaFold Protein Structure Database (AFDB) offers unparalleled structural coverage at near-experimental accuracy, positioning it as a valuable resource for data-driven protein design. However, its direct use in training deep models that are sensitive to fine-grained atomic geometry—such as inverse folding—exposes a critical limitation. Comparative analysis of structural feature distributions reveals that AFDB structures exhibit distinct statistical regularities, reflecting a systematic geometric bias that deviates from the conformational diversity found in experimentally determined structures from the Protein Data Bank (PDB). While AFDB structures are cleaner and more idealized, PDB structures capture the intrinsic variability and physical realism essential for generalization in downstream tasks. To address this discrepancy, we introduce a Debiasing Structure AutoEncoder (DeSAE) that learns to reconstruct native-like conformations from intentionally corrupted backbone geometries. By training the model to recover plausible structural states, DeSAE implicitly captures a more robust and natural structural manifold. At inference, applying DeSAE to AFDB structures produces debiased structures that significantly improve inverse folding performance across multiple benchmarks. This work highlights the critical impact of subtle systematic biases in predicted structures and presents a principled framework for debiasing, significantly boosting the performance of structure-based learning tasks like inverse folding.
Unsupervised Trajectory Optimization for 3D Registration in Serial Section Electron Microscopy using Neural ODEs
Zhenbang Zhang · Jingtong Feng · Hongjia Li · Haythem El-Messiry · Zhiqiang Xu · Renmin Han
Series Section Electron Microscopy (ssEM) has emerged as a pivotal technology for deciphering nanoscale biological architectures. Three-dimensional (3D) registration is a critical step in ssEM, tasked with rectifying axial misalignments and nonlinear distortions introduced during serial sectioning. The core scientific challenge lies in achieving distortion mitigation without erasing the natural morphological deformations of biological tissues, thereby enabling faithful reconstruction of 3D ultrastructural organization. In this study, we present a paradigm-shifting optimization framework that rethinks 3D registration through the lens of manifold trajectory optimization. We propose the first continuous trajectory dynamics formulation for 3D registration and introduce a novel optimization strategy. Specifically, we introduce a dual optimization objective that inherently balances global trajectory smoothness with local structural preservation, while developing a solver that combines Gauss-Seidel iteration with Neural ODEs to systematically integrate biophysical priors with data-driven deformation compensation. A key strength of our method is its fully unsupervised training, which avoids reliance on ground truth and suits ssEM scenarios where annotations are difficult to obtain. Extensive experiments on multiple datasets spanning diverse tissue types demonstrate our method's superior performance in structural restoration accuracy and cross-tissue robustness.
CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research
Owen Queen · Harrison Zhang · James Zou
Variant and gene interpretation are fundamental to personalized medicine and translational biomedicine. However, traditional approaches are manual and labor-intensive. Generative language models (LMs) can facilitate this process, accelerating the translation of fundamental research into clinically-actionable insights. While existing benchmarks have attempted to quantify the capabilities of LMs for interpreting scientific data, these studies focus on narrow tasks that do not translate to real-world research. To meet these challenges, we introduce CGBench, a robust benchmark that tests reasoning capabilities of LMs on scientific publications. CGBench is built from ClinGen, a resource of expert-curated literature interpretations in clinical genetics. CGBench measures the ability to 1) extract relevant experimental results following precise protocols and guidelines, 2) judge the strength of evidence, and 3) categorize and describe the relevant outcome of experiments. We test 8 different LMs and find that while models show promise, substantial gaps exist in literature interpretation, especially on fine-grained instructions. Reasoning models excel in fine-grained tasks but non-reasoning models are better at high-level interpretations. Finally, we measure LM explanations against human explanations with an LM judge approach, revealing that models often hallucinate or misinterpret results even when correctly classifying evidence. CGBench reveals strengths and weaknesses of LMs for precise interpretation of scientific publications, opening avenues for future research in AI for clinical genetics and science more broadly.
Augmenting Biological Fitness Prediction Benchmarks with Landscapes Features from GraphFLA
Mingyu Huang · Shasha Zhou · Ke Li
Machine learning models increasingly map biological sequence-fitness landscapes to predict mutational effects. Effective evaluation of these models requires benchmarks curated from empirical data. Despite their impressive scales, existing benchmarks lack topographical information regarding the underlying fitness landscapes, which hampers interpretation and comparison of model performance beyond averaged scores. Here, we introduce GraphFLA, a Python framework that constructs and analyzes fitness landscapes from diverse modalities (DNA, RNA, protein, and beyond.), accommodating datasets up to millions of mutants. GraphFLA calculates 20 biologically relevant features that characterize 4 fundamental aspects of landscape topography. By applying GraphFLA to over 5,300 landscapes from ProteinGym, RNAGym, and CIS-BP, we demonstrate its utility in interpreting and comparing the performance of dozens of fitness prediction models, highlighting factors influencing model accuracy and respective advantages of different models. Additionally, we release 155 combinatorially complete empirical fitness landscapes, encompassing over 2.2 million sequences across various modalities. All the codes and datasets are available at https://github.com/COLA-Laboratory/GraphFLA.
PIPE: Physics-Informed Position Encoding for Alignment of Satellite Images and Time Series in Typhoon Forecasting
Haobo Li · Eunseo Jung · Zixin CHEN · Zhaowei Wang · Yueya WANG · Huamin Qu · Alexis Lau
Multimodal time series forecasting is foundational in various fields, such as utilizing satellite imagery and numerical data for predicting typhoons in climate science. However, existing multimodal approaches primarily focus on utilizing text data to help time series forecasting, leaving the visual data in existing time series datasets underexplored. Furthermore, it is challenging for models to effectively capture the physical information embedded in visual data, such as satellite imagery's temporal and geospatial context, which extends beyond images themselves. To address this gap, we propose physics-informed positional encoding (PIPE), a lightweight method that embeds physical information into vision language models (VLMs). PIPE introduces two key innovations: (1) a physics-informed positional indexing scheme for mapping physics to positional IDs, and (2) a variant-frequency positional encoding mechanism for encoding frequency information of physical variables and sequential order of tokens within the embedding space. By preserving both the physical information and sequential order information, PIPE significantly improves multimodal alignment and forecasting accuracy. Through the experiments on the most representative and the largest open-sourced satellite image dataset, PIPE achieves state-of-the-art performance in both deep learning forecasting and climate domain methods, demonstrating superiority across benchmarks, including a 12\% improvement in typhoon intensity forecasting over prior works.
Causal Climate Emulation with Bayesian Filtering
Sebastian H. M. Hickman · Ilija Trajković · Julia Kaltenborn · Francis Pelletier · Alex Archibald · Yaniv Gurwicz · Peer Nowack · David Rolnick · Julien Boussard
Traditional models of climate change use complex systems of coupled equations to simulate physical processes across the Earth system. These simulations are highly computationally expensive, limiting our predictions of climate change and analyses of its causes and effects. Machine learning has the potential to quickly emulate data from climate models, but current approaches are not able to incorporate physically-based causal relationships. Here, we develop an interpretable climate model emulator based on causal representation learning. We derive a novel approach including a Bayesian filter for stable long-term autoregressive emulation. We demonstrate that our emulator learns accurate climate dynamics, and we show the importance of each one of its components on a realistic synthetic dataset and data from two widely deployed climate models.
NUTS: Eddy-Robust Reconstruction of Surface Ocean Nutrients via Two-Scale Modeling
Hao Zheng · Shiyu Liang · Yuting Zheng · Chaofan Sun · LEI BAI · Enhui Liao
Reconstructing ocean surface nutrients from sparse observations is critical for understanding long-term biogeochemical cycles. Most prior work focuses on reconstructing atmospheric fields and treats the reconstruction problem as image inpainting, assuming smooth, single-scale dynamics. In contrast, nutrient transport follows advection–diffusion dynamics under nonstationary, multiscale ocean flow. This mismatch leads to instability, as small errors in unresolved eddies can propagate through time and distort nutrient predictions. To address this, we introduce NUTS, a two-scale reconstruction model that decouples large-scale transport and mesoscale variability. The homogenized solver captures stable, coarse-scale advection under filtered flow. A refinement module then restores mesoscale detail conditioned on the residual eddy field. NUTS is stable, interpretable, and robust to mesoscale perturbations, with theoretical guarantees from homogenization theory. NUTS outperforms all data-driven baselines in global reconstruction and achieves site-wise accuracy comparable to numerical models. On real observations, NUTS reduces NRMSE by 79.9% for phosphate and 19.3% for nitrate over the best baseline. Ablation studies validate the effectiveness of each module.
OmniCast: A Masked Latent Diffusion Model for Weather Forecasting Across Time Scales
Tung Nguyen · Tuan Pham · Troy Arcomano · Rao Kotamarthi · Ian Foster · Sandeep Madireddy · Aditya Grover
Accurate weather forecasting across time scales is critical for anticipating and mitigating the impacts of climate change. Recent data-driven methods based on deep learning have achieved significant success in the medium range, but struggle at longer subseasonal-to-seasonal (S2S) horizons due to error accumulation in their autoregressive approach. In this work, we propose OmniCast, a scalable and skillful probabilistic model that unifies weather forecasting across timescales. OmniCast consists of two components: a VAE model that encodes raw weather data into a continuous, lower-dimensional latent space, and a diffusion-based transformer model that generates a sequence of future latent tokens given the initial conditioning tokens. During training, we mask random future tokens and train the transformer to estimate their distribution given conditioning and visible tokens using a per-token diffusion head. During inference, the transformer generates the full sequence of future tokens by iteratively unmasking random subsets of tokens. This joint sampling across space and time mitigates compounding errors from autoregressive approaches. The low-dimensional latent space enables modeling long sequences of future latent states, allowing the transformer to learn weather dynamics beyond initial conditions. OmniCast performs competitively with leading probabilistic methods at the medium-range timescale while being 10× to 20× faster, and achieves state-of-the-art performance at the subseasonal-to-seasonal scale across accuracy, physics-based, and probabilistic metrics. Furthermore, we demonstrate that OmniCast can generate stable rollouts up to 100 years ahead. Code and model checkpoints are available at https://github.com/tung-nd/omnicast.
Scalable and Cost-Efficient de Novo Template-Based Molecular Generation
Piotr Gaiński · Oussama Boussif · Andrei Rekesh · Dmytro Shevchuk · Ali Parviz · Mike Tyers · Robert Batey · Michał Koziarski
Template-based molecular generation offers a promising avenue for drug design by ensuring generated compounds are synthetically accessible through predefined reaction templates and building blocks. In this work, we tackle three core challenges in template-based GFlowNets: (1) minimizing synthesis cost, (2) scaling to large building block libraries, and (3) effectively utilizing small fragment sets. We propose Recursive Cost Guidance, a backward policy framework that employs auxiliary machine learning models to approximate synthesis cost and viability. This guidance steers generation toward low-cost synthesis pathways, significantly enhancing cost-efficiency, molecular diversity, and quality, especially when paired with an Exploitation Penalty that balances the trade-off between exploration and exploitation. To enhance performance in smaller building block libraries, we develop a Dynamic Library mechanism that reuses intermediate high-reward states to construct full synthesis trees. Our approach establishes state-of-the-art results in template-based molecular generation.
Sampling 3D Molecular Conformers with Diffusion Transformers
J. Thorben Frank · Winfried Ripken · Gregor Lied · Klaus-Robert Müller · Oliver Unke · Stefan Chmiela
Diffusion Transformers (DiTs) have demonstrated strong performance in generative modeling, particularly in image synthesis, making them a compelling choice for molecular conformer generation. However, applying DiTs to molecules introduces novel challenges, such as integrating discrete molecular graph information with continuous 3D geometry, handling Euclidean symmetries, and designing conditioning mechanisms that generalize across molecules of varying sizes and structures. We propose DiTMC, a framework that adapts DiTs to address these challenges through a modular architecture that separates the processing of 3D coordinates from conditioning on atomic connectivity. To this end, we introduce two complementary graph-based conditioning strategies that integrate seamlessly with the DiT architecture. These are combined with different attention mechanisms, including both standard non-equivariant and SO(3)-equivariant formulations, enabling flexible control over the trade-off between between accuracy and computational efficiency. Experiments on standard conformer generation benchmarks (GEOM-QM9, -DRUGS, -XL) demonstrate that DiTMC achieves state-of-the-art precision and physical validity. Our results highlight how architectural choices and symmetry priors affect sample quality and efficiency, suggesting promising directions for large-scale generative modeling of molecular structures. Code is available at https://github.com/ML4MolSim/dit_mc.
Flash Invariant Point Attention
Andrew Liu · Axel Elaldi · Nicholas Franklin · Nathan Russell · Gurinder Atwal · Yih-En Ban · Olivia Viessmann
Invariant Point Attention (IPA) is a key algorithm for geometry-aware modeling in structural biology, central to many protein and RNA models. However, its quadratic complexity limits the input sequence length. We introduce FlashIPA, a factorized reformulation of IPA that leverages hardware-efficient FlashAttention to achieve linear scaling in GPU memory and wall-clock time with sequence length. FlashIPA matches or exceeds standard IPA performance while substantially reducing computational costs. FlashIPA extends training to previously unattainable lengths, and we demonstrate this by re-training generative models without length restrictions and generating structures of thousands of residues. FlashIPA is available at https://github.com/flagshippioneering/flash_ipa.
Flexible MOF Generation with Torsion-Aware Flow Matching
Nayoung Kim · Seongsu Kim · Sungsoo Ahn
Designing metal-organic frameworks (MOFs) with novel chemistries is a longstanding challenge due to their large combinatorial space and complex 3D arrangements of the building blocks. While recent deep generative models have enabled scalable MOF generation, they assume (1) a fixed set of building blocks and (2) known local 3D coordinates of building blocks. However, this limits their ability to (1) design novel MOFs and (2) generate the structure using novel building blocks. We propose a two-stage MOF generation framework that overcomes these limitations by modeling both chemical and geometric degrees of freedom. First, we train an SMILES-based autoregressive model to generate metal and organic building blocks, paired with a cheminformatics toolkit for 3D structure initialization. Second, we introduce a flow matching model that predicts translations, rotations, and torsional angles to assemble the blocks into valid 3D frameworks. Our experiments demonstrate improved reconstruction accuracy, the generation of valid, novel, and unique MOFs, and the ability to create novel building blocks. Our code is available at https://github.com/nayoung10/MOFFlow-2.
RiboFlow: Conditional De Novo RNA Co-Design via Synergistic Flow Matching
Runze Ma · Zhongyue Zhang · Zichen Wang · Chenqing Hua · Jiahua Rao · Zhuomin Zhou · Shuangjia Zheng
Ribonucleic acid (RNA) binds to molecules to achieve specific biological functions. While generative models are advancing biomolecule design, existing methods for designing RNA that target specific ligands face limitations in capturing RNA’s conformational flexibility, ensuring structural validity, and overcoming data scarcity. To address these challenges, we introduce RiboFlow, a synergistic flow matching model to co-design RNA structures and sequences based on target molecules. By integrating RNA backbone frames, torsion angles, and sequence features in an unified architecture, RiboFlow explicitly models RNA’s dynamic conformations while enforcing sequence-structure consistency to improve validity. Additionally, we curate RiboBind, a large-scale dataset of RNA-molecule interactions, to resolve the scarcity of high-quality structural data. Extensive experiments reveal that RiboFlow not only outperforms state-of-the-art RNA design methods by a large margin but also showcases controllable capabilities for achieving high binding affinity to target ligands. Our work bridges critical gaps in controllable RNA design, offering a framework for structure-aware, data-efficient generation.
Controllable 3D Molecular Generation for Structure-Based Drug Design Through Bayesian Flow Networks and Gradient Integration
Seungyeon Choi · Hwanhee Kim · Chihyun Park · Dahyeon Lee · Seungyong Lee · Yoonju Kim · Hyoungjoon Park · Sein Kwon · Youngwan Jo · Sanghyun Park
Recent advances in Structure-based Drug Design (SBDD) have leveraged generative models for 3D molecular generation, predominantly evaluating model performance by binding affinity to target proteins. However, practical drug discovery necessitates high binding affinity along with synthetic feasibility and selectivity, critical properties that were largely neglected in previous evaluations. To address this gap, we identify fundamental limitations of conventional diffusion-based generative models in effectively guiding molecule generation toward these diverse pharmacological properties. We propose $\texttt{CByG}$, a novel framework extending Bayesian Flow Network into a gradient-based conditional generative model that robustly integrates property-specific guidance. Additionally, we introduce a comprehensive evaluation scheme incorporating practical benchmarks for binding affinity, synthetic feasibility, and selectivity, overcoming the limitations of conventional evaluation methods. Extensive experiments demonstrate that our proposed $\texttt{CByG}$, framework significantly outperforms baseline models across multiple essential evaluation criteria, highlighting its effectiveness and practicality for real-world drug discovery applications.
Ambient Proteins - Training Diffusion Models on Noisy Structures
Giannis Daras · Jeffrey Ouyang-Zhang · Krithika Ravishankar · Constantinos Daskalakis · Adam Klivans · Daniel Diaz
We present Ambient Protein Diffusion, a framework for training protein diffusion models that generates structures with unprecedented diversity and quality. State-of-the-art generative models are trained on computationally derived structures from AlphaFold2 (AF), as experimentally determined structures are relatively scarce. The resulting models are therefore limited by the quality of synthetic datasets. Since the accuracy of AF predictions degrades with increasing protein length and complexity, de novo generation of long, complex proteins remains challenging. Ambient Protein Diffusion overcomes this problem by treating low-confidence AF structures as corrupted data. Rather than simply filtering out low-quality AF structures, our method adjusts the diffusion objective for each structure based on its corruption level, allowing the model to learn from both high and low quality structures. Empirically, ambient protein diffusion yields major improvements: on proteins with 700 residues, diversity increases from 45% to 85% from the previous state-of-the-art, and designability improves from 70% to 88%.
E2Former: An Efficient and Equivariant Transformer with Linear-Scaling Tensor Products
Yunyang Li · Lin Huang · Zhihao Ding · Xinran Wei · Chu Wang · Han Yang · Zun Wang · Chang Liu · Yu Shi · Peiran Jin · Tao Qin · Mark Gerstein · Jia Zhang
Equivariant Graph Neural Networks (EGNNs) have demonstrated significant success in modeling microscale systems, including those in chemistry, biology and materials science. However, EGNNs face substantial computational challenges due to the high cost of constructing edge features via spherical tensor products, making them almost impractical for large-scale systems. To address this limitation, we introduce E2Former, an equivariant and efficient transformer architecture that incorporates a Wigner $6j$ convolution (Wigner $6j$ Conv). By shifting the computational burden from edges to nodes, Wigner $6j$ Conv reduces the complexity from $O(| \mathcal{E}|)$ to $O(| \mathcal{V}|)$ while preserving both the model's expressive power and rotational equivariance. We show that this approach achieves a 7x–30x speedup compared to conventional $\mathrm{SO}(3)$ convolutions. Furthermore, our empirical results demonstrate that the derived E2Former mitigates the computational challenges of existing approaches without compromising the ability to capture detailed geometric information. This development could suggest a promising direction for scalable molecular modeling.
SAINT: Sequence-Aware Integration for Spatial Transcriptomics Multi-View Clustering
Zeyu Zhu · KE LIANG · Lingyuan Meng · Meng Liu · Suyuan Liu · Renxiang Guan · Miaomiao Li · Wanwei Liu · Xinwang Liu
Spatial transcriptomics (ST) technologies provide gene expression measurements with spatial resolution, enabling the dissection of tissue structure and function. A fundamental challenge in ST analysis is clustering spatial spots into coherent functional regions. While existing models effectively integrate expression and spatial signals, they largely overlook sequence-level biological priors encoded in the DNA sequences of expressed genes. To bridge this gap, we propose SAINT (Sequence-Aware Integration for Nucleotide-informed Transcriptomics), a unified framework that augments spatial representation learning with nucleotide-derived features. We construct sequence-augmented datasets across 14 tissue sections from three widely used ST benchmarks (DLPFC, HBC, and MBA), retrieving reference DNA sequences for each expressed gene and encoding them using a pretrained Nucleotide Transformer. For each spot, gene-level embeddings are aggregated via expression-weighted and attention-based pooling, then fused with spatial-expression representations through a late fusion module. Extensive experiments demonstrate that SAINT consistently improves clustering performance across multiple datasets. Experiments validate the superiority, effectiveness, sensitivity, and transferability of our framework, confirming the complementary value of incorporating sequence-level priors into spatial transcriptomics clustering.
Variational Regularized Unbalanced Optimal Transport: Single Network, Least Action
Yuhao Sun · Zhenyi Zhang · Zihan Wang · Tiejun Li · Peijie Zhou
Recovering the dynamics from a few snapshots of a high-dimensional system is a challenging task in statistical physics and machine learning, with important applications in computational biology. Many algorithms have been developed to tackle this problem, based on frameworks such as optimal transport and the Schrödinger bridge. A notable recent framework is Regularized Unbalanced Optimal Transport (RUOT), which integrates both stochastic dynamics and unnormalized distributions. However, since many existing methods do not explicitly enforce optimality conditions, their solutions often struggle to satisfy the principle of least action and meet challenges to converge in a stable and reliable way. To address these issues, we propose Variational RUOT (Var-RUOT), a new framework to solve the RUOT problem. By incorporating the optimal necessary conditions for the RUOT problem into both the parameterization of the search space and the loss function design, Var-RUOT only needs to learn a scalar field to solve the RUOT problem and can search for solutions with lower action. We also examined the challenge of selecting a growth penalty function in the widely used Wasserstein-Fisher-Rao metric and proposed a solution that better aligns with biological priors in Var-RUOT. We validated the effectiveness of Var-RUOT on both simulated data and real single-cell datasets. Compared with existing algorithms, Var-RUOT can find solutions with lower action while exhibiting faster convergence and improved training stability. Our code is available at https://github.com/ZerooVector/VarRUOT.
Ctrl-DNA: Controllable Cell-Type-Specific Regulatory DNA Design via Constrained RL
Xingyu Chen · Shihao Ma · Runsheng Lin · Jiecong Lin · Bo Wang
Designing regulatory DNA sequences that achieve precise cell-type-specific gene expression is crucial for advancements in synthetic biology, gene therapy and precision medicine. Although transformer-based language models (LMs) can effectively capture patterns in regulatory DNA, their generative approaches often struggle to produce novel sequences with reliable cell-specific activity. Here, we introduce Ctrl-DNA, a novel constrained reinforcement learning (RL) framework tailored for designing regulatory DNA sequences with controllable cell-type specificity. By formulating regulatory sequence design as a biologically informed constrained optimization problem, we apply RL to autoregressive genomic LMs, enabling the models to iteratively refine sequences that maximize regulatory activity in targeted cell types while constraining off-target effects. Our evaluation on human promoters and enhancers demonstrates that Ctrl-DNA consistently outperforms existing generative and RL-based approaches, generating high-fitness regulatory sequences and achieving state-of-the-art cell-type specificity. Moreover, Ctrl-DNA-generated sequences capture key cell-type-specific transcription factor binding sites (TFBS), short DNA motifs recognized by regulatory proteins that control gene expression, demonstrating the biological plausibility of the generated sequences.
HiPoNet: A Multi-View Simplicial Complex Network for High Dimensional Point-Cloud and Single-Cell data
Siddharth Viswanath · Hiren Madhu · Dhananjay Bhaskar · Jake Kovalic · Dave Johnson · Christopher Tape · Ian Adelstein · Rex Ying · Michael Perlmutter · Smita Krishnaswamy
In this paper, we propose HiPoNet, an end-to-end differentiable neural network for regression, classification, and representation learning on high-dimensional point clouds. Our work is motivated by single-cell data which can have very high-dimensionality --exceeding the capabilities of existing methods for point clouds which are mostly tailored for 3D data. Moreover, modern single-cell and spatial experiments now yield entire cohorts of datasets (i.e., one data set for every patient), necessitating models that can process large, high-dimensional point-clouds at scale. Most current approaches build a single nearest-neighbor graph, discarding important geometric and topological information. In contrast, HiPoNet models the point-cloud as a set of higher-order simplicial complexes, with each particular complex being created using a reweighting of features. This method thus generates multiple constructs corresponding to different views of high-dimensional data, which in biology offers the possibility of disentangling distinct cellular processes. It then employs simplicial wavelet transforms to extract multiscale features, capturing both local and global topology from each view. We show that geometric and topological information is preserved in this framework both theoretically and empirically. We showcase the utility of HiPoNet on point-cloud level tasks, involving classification and regression of entire point-clouds in data cohorts. Experimentally, we find that HiPoNet outperforms other point-cloud and graph-based models on single-cell data. We also apply HiPoNet to spatial transcriptomics datasets using spatial coordinates as one of the views. Overall, HiPoNet offers a robust and scalable solution for high-dimensional data analysis.
Securing the Language of Life: Inheritable Watermarks from DNA Language Models to Proteins
ZAIXI ZHANG · Ruofan Jin · Le Cong · Mengdi Wang
DNA language models have revolutionized our ability to design and manipulate DNA sequences—the fundamental language of life—with unprecedented precision, enabling transformative applications in therapeutics, synthetic biology, and gene editing. However, this capability also poses significant dual-use risks, including the potential creation of harmful biological agents. To address these biosecurity challenges, we introduce two innovative watermarking techniques: DNAMark and CentralMark. DNAMark employs synonymous codon substitutions to embed robust watermarks in DNA sequences while preserving the function of encoded proteins. CentralMark advances this by creating inheritable watermarks that transfer from DNA to translated proteins, leveraging protein embeddings to ensure detection across the central dogma. Both methods utilize state-of-the-art embeddings to generate watermark logits, enhancing resilience against natural mutations, synthesis errors, and adversarial attacks. Evaluated on a therapeutic DNA benchmark, DNAMark and CentralMark achieve F1 detection scores above 0.85 under diverse conditions, while maintaining over 60\% sequence similarity to ground truth and degeneracy scores below 15\%. A case study on a CRISPR-Cas9 system underscores CentralMark’s utility in real-world synthetic biology. This work establishes a vital framework for securing DNA language models, balancing innovation with accountability to mitigate biosecurity risks.
Learning Crossmodal Interaction Patterns via Attributed Bipartite Graphs for Single-Cell Omics
Xiaotang Wang · Xuanwei Lin · Yun Zhu · Hao Li · Yongqi Zhang
Crossmodal matching in single-cell omics is essential for explaining biological regulatory mechanisms and enhancing downstream analyses. However, current single-cell crossmodal models often suffer from three limitations: sparse modality signals, underutilization of biological attributes, and insufficient modeling of regulatory interactions. These challenges hinder generalization in data-scarce settings and restrict the ability to uncover fine-grained biologically meaningful crossmodal relationships. Here, we present a novel framework which reformulates crossmodal matching as a graph classification task on Attributed Bipartite Graphs (ABGs). It models single-cell ATAC-RNA data as an ABG, where each expressed ATAC and RNA is treated as a distinct node with unique IDs and biological features. To model crossmodal interaction patterns on the constructed ABG, we propose $\text{Bi}^2\text{Former}$, a **bi**ologically-driven **bi**partite graph trans**former** that learns interpretable attention over ATAC–RNA pairs. This design enables the model to effectively learn and explain biological regulatory relationships between ATAC and RNA modalities. Extensive experiments demonstrate that $\text{Bi}^2\text{Former}$ achieves state-of-the-art performance in crossmodal matching across diverse datasets, remains robust under sparse training data, generalizes to unseen cell types and datasets, and reveals biologically meaningful regulatory patterns. This work pioneers an ABG-based approach for single-cell crossmodal matching, offering a powerful framework for uncovering regulatory interactions at the single-cell omics. Our code is available at: https://github.com/wangxiaotang0906/Bi2Former.
ChromFound: Towards A Universal Foundation Model for Single-Cell Chromatin Accessibiltiy Data
Yifeng Jiao · Yuchen Liu · Yu Zhang · Xin Guo · Yushuai Wu · Chen Jiang · Jiyang Li · Hongwei Zhang · LIMEI HAN · Xin Gao · Yuan Qi · Yuan Cheng
The advent of single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) offers an innovative perspective for deciphering regulatory mechanisms by assembling a vast repository of single-cell chromatin accessibility data. While foundation models have achieved significant success in single-cell transcriptomics, there is currently no foundation model for scATAC-seq that supports zero-shot high-quality cell identification and comprehensive multi-omics analysis simultaneously. Key challenges lie in the high dimensionality and sparsity of scATAC-seq data, as well as the lack of a standardized schema for representing open chromatin regions (OCRs). Here, we present ChromFound, a foundation model tailored for scATAC-seq. ChromFound utilizes a hybrid architecture and genome-aware tokenization to effectively capture genome-wide long contexts and regulatory signals from dynamic chromatin landscapes. Pretrained on 1.97 million cells from 30 tissues and 6 disease conditions, ChromFound demonstrates broad applicability across 6 diverse tasks. Notably, it achieves robust zero-shot performance in generating universal cell representations and exhibits excellent transferability in cell type annotation and cross-omics prediction. By uncovering enhancer-gene links undetected by existing computational methods, ChromFound offers a promising framework for understanding disease risk variants in the noncoding genome. The implementation of ChromFound is available via https://github.com/JohnsonKlose/ChromFound.
GeneFlow: Translation of Single-cell Gene Expression to Histopathological Images via Rectified Flow
Mengbo Wang · Shourya Verma · Aditya Malusare · Luopin Wang · Yiyang Lu · Vaneet Aggarwal · Mario Sola · Ananth Grama · Nadia Lanman
Spatial transcriptomics technologies can be used to align transcriptomes with histopathological morphology, presenting exciting new opportunities for biomolecular discovery. Using spatial transcriptomic gene expression and corresponding histology data, we construct a novel framework, GeneFlow, to map single- and multi-cell gene expression onto paired cellular images. By combining an attention-based RNA encoder with a conditional UNet guided by rectified flow, we generate high-resolution images with different staining methods (e.g., H\&E, DAPI) to highlight various cellular/ tissue structures. Rectified flow with high-order ODE solvers creates a continuous, bijective mapping between expression and image manifolds, addressing the many-to-one relationship inherent in this problem. Our method enables the generation of realistic cellular morphology features and spatially resolved intercellular interactions under genetic or chemical perturbations. This enables minimally invasive disease diagnosis by revealing dysregulated patterns in imaging phenotypes. Our rectified flow based method outperforms diffusion methods and baselines in all experiments. Code is available at https://github.com/wangmengbo/GeneFlow.
RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis
YANG SONGXIAO · Haolin Wang · Yao Fu · Ye Tian · Tamostu Kamishima · Masayuki Ikebe · Yafei Ou · Masatoshi Okutomi
Rheumatoid arthritis (RA) is a common autoimmune disease that has been the focus of research in computer-aided diagnosis (CAD) and disease monitoring. In clinical settings, conventional radiography (CR) is widely used for the screening and evaluation of RA due to its low cost and accessibility. The wrist is a critical region for the diagnosis of RA. However, CAD research in this area remains limited, primarily due to the challenges in acquiring high-quality instance-level annotations. (i) The wrist comprises numerous small bones with narrow joint spaces, complex structures, and frequent overlaps, requiring detailed anatomical knowledge for accurate annotation. (ii) Disease progression in RA often leads to osteophyte, bone erosion (BE), and even bony ankylosis, which alter bone morphology and increase annotation difficulty, necessitating expertise in rheumatology.This work presents a multi-task dataset for wrist bone in CR, including two tasks: (i) wrist bone instance segmentation and (ii) Sharp/van der Heijde (SvdH) BE scoring, which is the first public resource for wrist bone instance segmentation. This dataset comprises 1048 wrist conventional radiographs of 388 patients from six medical centers, with pixel-level instance segmentation annotations for 618 images and SvdH BE scores for 800 images. This dataset can potentially support a wide range of research tasks related to RA, including joint space narrowing (JSN) progression quantification, BE detection, bone deformity evaluation, and osteophyte detection. It may also be applied to other wrist-related tasks, such as carpal bone fracture localization.We hope this dataset will significantly lower the barrier to research on wrist RA and accelerate progress in CAD research within the RA-related domain.Benchmark \& Code: https://github.com/YSongxiao/RAM-W600Data \& Dataset Card: https://huggingface.co/datasets/TokyoTechMagicYang/RAM-W600
Democratizing Clinical Risk Prediction with Cross-Cohort Cross-Modal Knowledge Transfer
Qiannan Zhang · Manqi Zhou · Zilong Bai · Chang Su · Fei Wang
Clinical risk prediction plays a crucial role in early disease detection and personalized intervention. While recent models increasingly incorporate multimodal data, their development typically assumes access to large-scale, multimodal datasets and substantial computational resources. In practice, however, most clinical sites operate under resource constraints, with access limited to EHR data alone and insufficient capacity to train complicated models. This gap highlights the urgent need to democratize clinical risk prediction by enabling effective deployment in data- and resource-limited local clinical settings. In this work, we propose a cross-cohort cross-modal knowledge transfer framework that leverages the multimodal model trained on a nationwide cohort and adapts it to local cohorts with only EHR data. We focus on EHR and genetic data as representative multimodal inputs and address two key challenges. First, to mitigate the influence of noisy or less informative biological signals, we propose a novel mixture-of-aggregations design to enhance the modeling of informative and relevant genetic features. Second, to support rapid model adaptation in low-resource sites, we develop a lightweight graph-guided fine-tuning method that adapts pretrained phenotypical EHR representations to target cohorts using limited patient data. Extensive experiments on real-world clinical data validate the effectiveness of our proposed model.
Doctor Approved: Generating Medically Accurate Skin Disease Images through AI-Expert Feedback
Janet Wang · Yunbei Zhang · Zhengming Ding · Jihun Hamm
Paucity of medical data severely limits the generalizability of diagnostic ML models, as the full spectrum of disease variability can not be represented by a small clinical dataset. To address this, diffusion models (DMs) have been considered as a promising avenue for synthetic image generation and augmentation. However, they frequently produce medically inaccurate images, deteriorating the model performance. Expert domain knowledge is critical for synthesizing images that correctly encode clinical information, especially when data is scarce and quality outweighs quantity. Existing approaches for incorporating human feedback, such as reinforcement learning (RL) and Direct Preference Optimization (DPO), rely on robust reward functions or demand labor-intensive expert evaluations. Recent progress in Multimodal Large Language Models (MLLMs) reveals their strong visual reasoning capabilities, making them adept candidates as evaluators. In this work, we propose a novel framework, coined MAGIC (Medically Accurate Generation of Images through AI-Expert Collaboration), that synthesizes clinically accurate skin disease images for data augmentation. Our method creatively translates expert-defined criteria into actionable feedback for image synthesis of DMs, significantly improving clinical accuracy while reducing the direct human workload. Experiments demonstrate that our method greatly improves the clinical quality of synthesized skin disease images, with outputs aligning with dermatologist assessments. Additionally, augmenting training data with these synthesized images improves diagnostic accuracy by +9.02% on a challenging 20-condition skin disease classification task, and by +13.89% in the few-shot setting. Beyond image synthesis, MAGIC illustrates a task-centric alignment paradigm: instead of adapting MLLMs to niche medical tasks, it adapts tasks to the evaluative strengths of general-purpose MLLMs by decomposing domain knowledge into attribute-level checklists. This design offers a scalable and reliable path for leveraging foundation models in specialized domains.
RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis
Haolin Li · Tianjie Dai · Zhe Chen · Siyuan Du · Jiangchao Yao · Ya Zhang · Yanfeng Wang
Clinical diagnosis is a highly specialized discipline requiring both domain expertise and strict adherence to rigorous guidelines. While current AI-driven medical research predominantly focuses on knowledge graphs or natural text pretraining paradigms to incorporate medical knowledge, these approaches primarily rely on implicitly encoded knowledge within model parameters, neglecting task-specific knowledge required by diverse downstream tasks. To address this limitation, we propose Retrieval-Augmented Diagnosis (RAD), a novel framework that explicitly injects external knowledge into multimodal models directly on downstream tasks. Specifically, RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss that constrains the latent distance between multi-modal features and guideline knowledge, and the dual transformer decoder that employs guidelines as queries to steer cross-modal fusion, aligning the models with clinical diagnostic workflows from guideline acquisition to feature extraction and decision-making. Moreover, recognizing the lack of quantitative evaluation of interpretability for multimodal diagnostic models, we introduce a set of criteria to assess the interpretability from both image and text perspectives. Extensive evaluations across four datasets with different anatomies demonstrate RAD's generalizability, achieving state-of-the-art performance. Furthermore, RAD enables the model to concentrate more precisely on abnormal regions and critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code is available at https://github.com/tdlhl/RAD.
Dual-Res Tandem Mamba-3D: Bilateral Breast Lesion Detection and Classification on Non-contrast Chest CT
Jiaheng Zhou · Wei Fang · Luyuan Xie · Yanfeng Zhou · Lianyan Xu · Minfeng Xu · Ge Yang · Yuxing Tang
Breast cancer remains a leading cause of death among women, with early detection significantly improving prognosis. Non-contrast computed tomography (NCCT) scans of the chest, routinely acquired for thoracic assessments, often capture the breast region incidentally, presenting an underexplored opportunity for opportunistic breast lesion detection without additional imaging cost or radiation. However, the subtle appearance of lesions in NCCT and the difficulty of jointly modeling lesion detection and malignancy classification pose unique challenges. In this work, we propose Dual-Res Tandem Mamba-3D (DRT-M3D), a novel multitask framework for opportunistic breast cancer analysis on NCCT scans. DRT-M3D introduces a dual-resolution architecture, which captures fine-grained spatial details for segmentation-based lesion detection and global contextual features for breast-level cancer classification. It further incorporates a tandem input mechanism that models bilateral breast regions jointly through Mamba-3D blocks, enabling cross-breast feature interaction by leveraging subtle asymmetries between the two sides. Our approach achieves state-of-the-art performance in both tasks across multi-institutional NCCT datasets spanning four medical centers. Extensive experiments and ablation studies validate the effectiveness of each key component.
Timely Clinical Diagnosis through Active Test Selection
Silas Ruhrberg Estévez · Nicolás Astorga · Mihaela van der Schaar
There is growing interest in using machine learning (ML) to support clinical diagnosis, but most approaches rely on static, fully observed datasets and fail to reflect the sequential, resource-aware reasoning clinicians use in practice. Diagnosis remains complex and error prone, especially in high-pressure or resource-limited settings, underscoring the need for frameworks that help clinicians make timely and cost-effective decisions. We propose ACTMED (Adaptive Clinical Test selection via Model-based Experimental Design), a diagnostic framework that integrates Bayesian Experimental Design (BED) with large language models (LLMs) to better emulate real-world diagnostic reasoning. At each step, ACTMED selects the test expected to yield the greatest reduction in diagnostic uncertainty for a given patient. LLMs act as flexible simulators, generating plausible patient state distributions and supporting belief updates without requiring structured, task-specific training data. Clinicians can remain in the loop; reviewing test suggestions, interpreting intermediate outputs, and applying clinical judgment throughout. We evaluate ACTMED on real-world datasets and show it can optimize test selection to improve diagnostic accuracy, interpretability, and resource use. This represents a step toward transparent, adaptive, and clinician-aligned diagnostic systems that generalize across settings with reduced reliance on domain-specific data.
Towards Generalizable Retina Vessel Segmentation with Deformable Graph Priors
Ke Liu · Shangde Gao · Yichao Fu · Shangqi Gao
Retinal vessel segmentation is critical for medical diagnosis, yet existing models often struggle to generalize across domains due to appearance variability, limited annotations, and complex vascular morphology. We propose GraphSeg, a variational Bayesian framework that integrates anatomical graph priors with structure-aware image decomposition to enhance cross-domain segmentation. GraphSeg factorizes retinal images into structure-preserved and structure-degraded components, enabling domain-invariant representation. A deformable graph prior, derived from a statistical retinal atlas, is incorporated via a differentiable alignment and guided by an unsupervised energy function. Experiments on three public benchmarks (CHASE, DRIVE, HRF) show that GraphSeg consistently outperforms existing methods under domain shifts. These results highlight the importance of jointly modeling anatomical topology and image structure for robust generalizable vessel segmentation.
EuroSpeech: A Multilingual Speech Corpus
Samuel Pfisterer · Florian Grötschla · Luca Lanzendörfer · Florian Yan · Roger Wattenhofer
Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for each language, leading to models trained on these datasets to exhibit poor performance on most supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The proposed pipeline includes robust components for media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio. Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments, achieving substantial per-language coverage with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours of high-quality speech data. We obtain an average 41.8\% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset, demonstrating the usefulness of our approach.
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
Mingyang Chen · Linzhuang Sun · Tianpeng Li · Haoze Sun · ZhouYijie · Chenzheng Zhu · Haofen Wang · Jeff Pan · Wen Zhang · Huajun Chen · Fan Yang · Zenan Zhou · weipeng chen
Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. We train ReSearch on Qwen2.5-7B(-Instruct) and Qwen2.5-32B(-Instruct) models and conduct extensive experiments. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks. Analysis reveals that ReSearch naturally elicits advanced reasoning capabilities such as reflection and self-correction during the reinforcement learning process.
MoonCast: High-Quality Zero-Shot Podcast Generation
Zeqian Ju · Dongchao Yang · Shen Kai · YICHONG LENG · Zhengtao Wang · Songxiang Liu · Xinyu Zhou · Tao Qin · Xiangyang Li · Jianwei Yu · Xu Tan
Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize spontaneous podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To enable long audio generation, we employ a language model with parameter, data, and context scaling to process sequences in an innovative format designed for modeling entire multi-speaker, multi-turn speech interactions. To enhance spontaneity, we observe that ASR transcripts capture spontaneous speech details (e.g., filler words indicating hesitations, and specific punctuation and spaces reflecting breathing pauses), suggesting that these transcripts can serve as a partial indicator of speech spontaneity. Building upon this assumption, we utilize a script generation module to generate scripts incorporating these spontaneous elements. Experiments show MoonCast outperforms baselines, with notable improvements in contextual coherence and spontaneity.
ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search
Zeyu Shen · Basileal Imana · Tong Wu · Chong Xiang · Prateek Mittal · Aleksandra Korolova
Retrieval-Augmented Generation (RAG) enhances Large Language Models by grounding their outputs in external documents. These systems, however, remain vulnerable to attacks on the retrieval corpus, such as prompt injection. RAG-based search systems (e.g., Google’s Search AI Overview) present an interesting setting for studying and protecting against such threats, as defense algorithms can benefit from built-in reliability signals—like document ranking—and represent a non-LLM challenge for the adversary due to decades of work to thwart SEO. Motivated by, but not limited to, this scenario, this work introduces ReliabilityRAG, a framework for adversarial robustness that explicitly leverages reliability information of retrieved documents. Our first contribution adopts a graph-theoretic perspective to identify a ``consistent majority'' among retrieved documents to filter out malicious ones. We introduce a novel algorithm based on finding a Maximum Independent Set (MIS) on a document graph where edges encode contradiction. Our MIS variant explicitly prioritizes higher-reliability documents and provides provable robustness guarantees against bounded adversarial corruption under natural assumptions. Recognizing the computational cost of exact MIS for large retrieval sets, our second contribution is a scalable weighted sample and aggregate framework. It explicitly utilizes reliability information, preserving some robustness guarantees while efficiently handling many documents. We present empirical results showing ReliabilityRAG provides superior robustness against adversarial attacks compared to prior methods, maintains high benign accuracy, and excels in long-form generation tasks where prior robustness-focused methods struggled. Our work is a significant step towards more effective, provably robust defenses against retrieved corpus corruption in RAG.
Exploiting LLMs for Automatic Hypothesis Assessment via a Logit-Based Calibrated Prior
Yue Gong · Raul Fernandez
As hypothesis generation becomes increasingly automated, a new bottleneck has emerged: hypothesis assessment. Modern systems can surface thousands of statistical relationships—correlations, trends, causal links—but offer little guidance on which ones are novel, non-trivial, or worthy of expert attention. In this work, we study the complementary problem to hypothesis generation: automatic hypothesis assessment. Specifically, we ask—given a large set of statistical relationships, can we automatically assess which ones are novel and worth further exploration? We focus on correlations as they are a common entry point in exploratory data analysis that often serve as the basis for forming deeper scientific or causal hypotheses. To support automatic assessment, we propose to leverage the vast knowledge encoded in LLMs' weights to derive a prior distribution over the correlation value of a variable pair. If an LLM's prior expects the correlation value observed, then such correlation is not surprising, and vice versa. We propose the Logit-based Calibrated Prior, an LLM-elicited correlation prior that transforms the model’s raw output logits into a calibrated, continuous predictive distribution over correlation values. We evaluate the prior on a benchmark of 2,096 real-world variable pairs and it achieves a sign accuracy of 78.8%, a mean absolute error of 0.26, and 95% credible interval coverage of 89.2% in predicting Pearson correlation coefficient. It also outperforms a fine-tuned RoBERTa classifier in binary correlation prediction and achieves higher precision@K in hypothesis ranking. We further show that the prior generalizes to correlations not seen during LLM pretraining, reflecting context-sensitive reasoning rather than memorization.
CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension
Rui Li · Zeyu Zhang · Xiaohe Bo · Zihang Tian · Xu Chen · Quanyu Dai · Zhenhua Dong · Ruiming Tang
Current Large Language Models (LLMs) are confronted with overwhelming information volume when comprehending long-form documents. This challenge raises the imperative of a cohesive memory module, which can elevate vanilla LLMs into autonomous reading agents. Despite the emergence of some heuristic approaches, a systematic design principle remains absent. To fill this void, we draw inspiration from Jean Piaget's Constructivist Theory, illuminating three traits of the agentic memory---structured schemata, flexible assimilation, and dynamic accommodation. This blueprint forges a clear path toward a more robust and efficient memory system for LLM-based reading comprehension. To this end, we develop CAM, a prototype implementation of Constructivist Agentic Memory that simultaneously embodies the structurality, flexibility, and dynamicity. At its core, CAM is endowed with an incremental overlapping clustering algorithm for structured memory development, supporting both coherent hierarchical summarization and online batch integration. During inference, CAM adaptively explores the memory structure to activate query-relevant information for contextual response, akin to the human associative process. Compared to existing approaches, our design demonstrates dual advantages in both performance and efficiency across diverse long-text reading comprehension tasks, including question answering, query-based summarization, and claim verification.
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
Xiaoxi Li · Jiajie Jin · Guanting Dong · Hongjin Qian · Yongkang Wu · Ji-Rong Wen · Yutao Zhu · Zhicheng Dou
Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose WebThinker, a deep research agent that empowers LRMs to autonomously search the web, navigate among web pages, and draft reports during the reasoning process. WebThinker integrates a Deep Web Explorer module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an Autonomous Think-Search-and-Draft strategy, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an RL-based training strategy via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue · Zhiqi Chen · Rui Lu · Andrew Zhao · Zhaokai Wang · Yang Yue · Shiji Song · Gao Huang
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly in mathematics and programming tasks. It is widely believed that, similar to how traditional RL helps agents to explore and learn new strategies, RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed the capacity of the corresponding base models. In this study, we take a critical look at \textit{the current state of RLVR} by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across diverse model families, RL algorithms, and math/coding/visual reasoning benchmarks, using pass@\textit{k} at large \textit{k} values as the evaluation metric. While RLVR improves sampling efficiency towards the correct path, we surprisingly find that current training does \emph{not} elicit fundamentally new reasoning patterns. We observe that while RLVR-trained models outperform their base models at smaller values of $k$ (\eg, $k$=1), base models achieve higher pass@$k$ score when $k$ is large. Moreover, we observe that the reasoning capability boundary of LLMs often narrows as RLVR training progresses. Further coverage and perplexity analysis shows that the reasoning paths generated by RLVR models are already included in the base models' sampling distribution, suggesting that their reasoning abilities originate from and are \textit{bounded} by the base model. From this perspective, treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in fully leveraging the potential of the base model. In contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model’s reasoning capabilities. Taken together, our findings suggest that current RLVR methods have not fully realized the potential of RL to elicit genuinely novel reasoning abilities in LLMs. This underscores the need for improved RL paradigms—such as continual scaling and multi-turn agent-environment interaction—to unlock this potential.
Scent of Knowledge: Optimizing Search-Enhanced Reasoning with Information Foraging
Hongjin Qian · Zheng Liu
Augmenting large language models (LLMs) with external retrieval has become a standard method to address their inherent knowledge cutoff limitations. However, traditional retrieval-augmented generation methods employ static, pre-inference retrieval strategies, making them inadequate for complex tasks involving ambiguous, multi-step, or evolving information needs. Recent advances in test-time scaling techniques have demonstrated significant potential in enabling LLMs to dynamically interact with external tools, motivating the shift toward adaptive inference-time retrieval. Inspired by Information Foraging Theory (IFT), we propose InForage, a reinforcement learning framework that formalizes retrieval-augmented reasoning as a dynamic information-seeking process. Unlike existing approaches, InForage explicitly rewards intermediate retrieval quality, encouraging LLMs to iteratively gather and integrate information through adaptive search behaviors. To facilitate training, we construct a human-guided dataset capturing iterative search and reasoning trajectories for complex, real-world web tasks. Extensive evaluations across general question answering, multi-hop reasoning tasks, and a newly developed real-time web QA dataset demonstrate InForage's superior performance over baseline methods. These results highlight InForage's effectiveness in building robust, adaptive, and efficient reasoning agents. We provide all codes and datasets in the supplementary materials.
Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection
Michelle Yuan · Khushbu Pahwa · Shuaichen Chang · Mustafa Kaba · MONICA SUNKARA · Jiarong Jiang · Xiaofei Ma · Yi Zhang
Designing effective agentic systems requires the seamless composition and integration of agents, tools, and models within dynamic and uncertain environments. Most existing methods rely on static, semantic retrieval approaches for tool or agent discovery. However, effective reuse and composition of existing components remain challenging due to incomplete capability descriptions and the limitations of retrieval methods. Component selection suffers because the decisions are not based on capability, cost, and real-time utility. To address these challenges, we introduce a structured, automated framework for agentic system composition that is inspired by the knapsack problem. Our framework enables a composer agent to systematically identify, select, and assemble an optimal set of agentic components by jointly considering performance, budget constraints, and compatibility. By dynamically testing candidate components and modeling their utility in real-time, our approach streamlines the assembly of agentic systems and facilitates scalable reuse of resources. Empirical evaluation with Claude 3.5 Sonnet across five benchmarking datasets shows that our online-knapsack-based composer consistently lies on the Pareto frontier, achieving higher success rates at significantly lower component costs compared to our baselines. In the single-agent setup, the online knapsack composer shows a success rate improvement of up to 31.6\% in comparison to the retrieval baselines. In multi-agent systems, the online knapsack composer increases success rate from 37\% to 87\% when agents are selected from an agent inventory of 100+ agents. The substantial performance gap confirms the robust adaptability of our method across diverse domains and budget constraints.
Explaining and Mitigating Crosslingual Tokenizer Inequities
Catherine Arnett · Tyler Chang · Stella Biderman · Benjamin Bergen
The number of tokens it takes to encode parallel text in different languages is known to vary. These disparities are called token premiums. Having high token premiums leads to less throughput during training and increases costs at inference. In this paper, we show that even after controlling for dataset size, vocabulary size, and data content, monolingual tokenizers exhibit a wide range of token premiums across languages. To understand the cross-linguistic differences that cause these token premiums, we train a suite of approximately 7,000 comparable monolingual tokenizers for 97 languages, manipulating tokenization algorithm vocabulary size, and dataset size. We measure token premiums and test for a relationship between factors such as data similarity (between tokenizer training and evaluation), vocabulary size, and pre-tokenization. We also investigate the role of language-specific features such as writing system and word length. We find that similarity between training and test data does not impact token premiums, but vocabulary size and pre-tokenization do. While simply increasing vocabulary size does not lead to reduced token premium effects, we can determine an "optimal" vocabulary size for each language to achieve significantly reduced token premium effects. We also train superword tokenizers which allow merges over whitespaces, and we find that they both reduce token premium effects and improve compression overall. Thus, intervening on the vocabulary size or the pre-tokenizer significantly reduces crosslingual token premium effects.
MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?
Yunxiang Zhang · Muhammad Khalifa · Shitanshu Bhushan · Grant Murphy · Lajanugen Logeswaran · Jaekyeom Kim · Moontae Lee · Honglak Lee · Lu Wang
We introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions, with a focus on open research problems that demand novel methodologies.Unlike prior work, e.g., AI Scientist, which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with rigorous protocol and objective metrics.Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB) closes only 9.3% of the gap between baseline and top human participant scores.Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and their actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, which is designed to continually grow with new ML competitions to encourage rigorous and objective evaluations of AI’s research capabilities. Our leaderboard and code are publicly available at https://huggingface.co/spaces/launch/MLRC_Bench.
A Controllable Examination for Long-Context Language Models
Yijun Yang · Zeyu Huang · Wenhao Zhu · Zihan Qiu · Fei Yuan · Jeff Pan · Ivan Titov
Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world applications (e.g, document summarization) and synthetic tasks (e.g, needle-in-a-haystack). Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks often involve complexity that makes interpretation challenging and suffer from data contamination, whereas synthetic tasks frequently lack meaningful coherence between the target information ("needle") and its surrounding context ("haystack"), undermining their validity as proxies for realistic applications. In response to these challenges, we posit that an ideal long-context evaluation framework should be characterized by three essential features: 1) seamless context: coherent contextual integration between target information and its surrounding context; 2) controllable setting: an extensible task setup that enables controlled studies—for example, incorporating additional required abilities such as numerical reasoning; and 3) sound evaluation: avoiding LLM-as-Judge and conduct exact-match to ensure deterministic and reproducible evaluation results.This study introduces $\textbf{LongBioBench}$, a benchmark that utilizes artificially generated biographies as a controlled environment for assessing LCLMs across dimensions of $\textit{understanding}$, $\textit{reasoning}$, and $\textit{trustworthiness}$. Our experimental evaluation, which includes $\textbf{18}$ LCLMs in total, demonstrates that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results and are less trustworthy as context length increases.Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence, numerical needles, and the absence of distractors, rendering them vulnerable to test the model's long-context capabilities.Moreover, we also reveal that long-context continual pretraining primarily adjusts RoPE embedding to accommodate extended context lengths, which in turn yields only marginal improvements in the model’s true capabilities. To sum up, compared to previous synthetic benchmarks, LongBioBench achieves a better trade-off between mirroring authentic language tasks and maintaining controllability, and is highly interpretable and configurable.
Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs
Xingang Guo · Yaxin Li · XiangYi Kong · YILAN JIANG · Xiayu Zhao · Zhihua Gong · Yufan Zhang · Daixuan Li · Tianle Sang · Beixiao Zhu · Gregory Jun · Yingbing Huang · Yiqi Liu · Yuqi Xue · Rahul Dev Kundu · Qi Lim · Yizhou Zhao · Luke Granger · Mohamed Younis · Darioush Keivan · Nippun Sabharwal · Shreyanka Sinha · Prakhar Agarwal · Kojo Vandyck · Hanlin Mai · Zichen Wang · Aditya Venkatesh · Ayush Barik · Jiankun Yang · Chongying Yue · Jingjie He · Libin Wang · Licheng Xu · Hao Chen · Jinwen Wang · Liujun Xu · Rushabh Shetty · Ziheng Guo · Dahui Song · Manvi Jha · Weijie Liang · Weiman Yan · Bryan Zhang · Sahil Bhandary Karnoor · Jialiang Zhang · Rutva Pandya · Xinyi Gong · Mithesh Ganesh · Feize Shi · Ruiling Xu · Yifan Zhang · Yanfeng Ouyang · Lianhui Qin · Elyse Rosenbaum · Corey Snyder · Peter Seiler · Geir Dullerud · Xiaojia Zhang · Zuofu Cheng · Pavan Kumar Hanumolu · Jian Huang · Mayank Kulkarni · Mahdi Namazifar · Huan Zhang · Bin Hu
Modern engineering, spanning electrical, mechanical, aerospace, civil, and computer disciplines, stands as a cornerstone of human civilization and the foundation of our society. However, engineering design poses a fundamentally different challenge for large language models (LLMs) compared with traditional textbook-style problem solving or factual question answering. Although existing benchmarks have driven progress in areas such as language understanding, code synthesis, and scientific problem solving, real-world engineering design demands the synthesis of domain knowledge, navigation of complex trade-offs, and management of the tedious processes that consume much of practicing engineers' time. Despite these shared challenges across engineering disciplines, no benchmark currently captures the unique demands of engineering design work. In this work, we introduce EngDesign, an Engineering Design benchmark that evaluates LLMs' abilities to perform practical design tasks across nine engineering domains. Unlike existing benchmarks that focus on factual recall or question answering, EngDesign uniquely emphasizes LLMs' ability to synthesize domain knowledge, reason under constraints, and generate functional, objective-oriented engineering designs. Each task in EngDesign represents a real-world engineering design problem, accompanied by a detailed task description specifying design goals, constraints, and performance requirements. EngDesign pioneers a simulation-based evaluation paradigm that moves beyond textbook knowledge to assess genuine engineering design capabilities and shifts evaluation from static answer checking to dynamic, simulation-driven functional verification, marking a crucial step toward realizing the vision of engineering Artificial General Intelligence (AGI).
SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors
chen yang · Hui Wang · Shiyao Wang · Junyang Chen · Jiabei He · Jiaming Zhou · Xi Yang · Yequan Wang · Yonghua Lin · Yong Qin
While voice technologies increasingly serve aging populations, current systems exhibit significant performance gaps due to inadequate training data capturing elderly-specific vocal characteristics like presbyphonia and dialectal variations. The limited data available on super-aged individuals in existing elderly speech datasets, coupled with overly simple recording styles and annotation dimensions, exacerbates this issue. To address the critical scarcity of speech data from individuals aged 75 and above, we introduce SeniorTalk, a carefully annotated Chinese spoken dialogue dataset. This dataset contains 55.53 hours of speech from 101 natural conversations involving 202 participants, ensuring a strategic balance across gender, region, and age. Through detailed annotation across multiple dimensions, it can support a wide range of speech tasks. We perform extensive experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks, offering crucial insights for the development of speech technologies targeting this age group. Code is available at https://github.com/flageval-baai/SeniorTalk and data at https://huggingface.co/datasets/evan0617/seniortalk.
AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web
RUI CAO · Zifeng Ding · Zhijiang Guo · Michael Schlichtkrull · Andreas Vlachos
Textual claims are often accompanied by images to enhance their credibility and spread on social media, but this also raises concerns about the spread of misinformation.Existing datasets for automated verification of image-text claims remain limited, as they often consist of synthetic claims and lack evidence annotations to capture the reasoning behind the verdict.In this work, we introduce AVerImaTeC, a dataset consisting of 1,297 real-world image-text claims. Each claim is annotated with question-answer (QA) pairs containing evidence from the web, reflecting a decomposed reasoning regarding the verdict.We mitigate common challenges in fact-checking datasets such as contextual dependence, temporal leakage, and evidence insufficiency, via claim normalization, temporally constrained evidence annotation, and a two-stage sufficiency check. We assess the consistency of the annotation in AVerImaTeC via inter-annotator studies, achieving a $\kappa=0.742$ on verdicts and $74.7\%$ consistency on QA pairs. We also propose a novel evaluation method for evidence retrieval and conduct extensive experiments to establish baselines for verifying image-text claims using open-web evidence.
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
Boyu Gou · Zanming Huang · Yuting Ning · Yu Gu · Michael Lin · Weijian Qi · Andrei Kopanev · Botao Yu · Bernal Jimenez Gutierrez · Yiheng Shu · Chan Hee (Luke) Song · Jiaman Wu · Shijie Chen · Hanane Moussa · TIANSHU ZHANG · Jian Xie · Yifei Li · Tianci Xue · Zeyi Liao · Kai Zhang · Boyuan Zheng · Zhaowei Cai · Viktor Rozgic · Morteza Ziyadi · Huan Sun · Yu Su
Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.
CoreaSpeech: Korean Speech Corpus via JAMO-based Coreset Selection for Efficient and Robust Korean Speech Generation
Ki-Joong Kwon · Junho So · Sang-Hoon Lee
While substantial advances have been achieved in TTS for languages such as English and Mandarin, Korean remains comparatively underrepresented due to the lack of rigorous preprocessing methods, systematically constructed datasets, a shortage of standardized Korean TTS benchmarks, and explicitly optimized models for Korean. To address these limitations, we propose a Korean-tailored data-refinement and coreset selection pipeline. It refines speech data and performs textual normalization especially for numerals and English terms, followed by a novel coreset selection strategy that leverages Jamo-based linguistic and phonological features unique to Korean. As a result, we release CoreaSpeech, an efficient and robust Korean speech corpus comprising 700 hours across 21,449 speakers. This refined core subset, evenly balanced across utterances ranging from 0 to 30 seconds, is derived from 2,058 hours of widely used Korean datasets. Building on this, we conducted extensive experiments via cross-lingual fine-tuning with our CoreaSpeech dataset. Furthermore, we introduce a new universal Korean TTS benchmark dataset including clean, noisy, and numeric subsets. Additionally, we demonstrate that our Korean-specific text normalization serves as a plug-and-play module, reliably improving performance regardless of the underlying TTS architecture. We publicly release our dataset, pipeline code, and evaluation benchmarks to support reproducible research and further advances in Korean and multilingual speech synthesis.
Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning
Honglin Lin · Qizhi Pei · Zhuoshi Pan · Yu Li · Xin Gao · Juntao Li · Conghui He · Lijun Wu
Reasoning capability is pivotal for Large Language Models (LLMs) to solve complex tasks, yet achieving reliable and scalable reasoning remains challenging. While Chain-of-Thought (CoT) prompting has become a mainstream approach, existing methods often suffer from uncontrolled generation, insufficient quality, and limited diversity in reasoning paths. Recent efforts leverage code to enhance CoT by grounding reasoning in executable steps, but such methods are typically constrained to predefined mathematical problems, hindering scalability and generalizability. In this work, we propose \texttt{Caco} (Code-Assisted Chain-of-ThOught), a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data through code-driven augmentation. Unlike prior work, \texttt{Caco} first fine-tunes a code-based CoT generator on existing math and programming solutions in a unified code format, then scales the data generation to a large amount of diverse reasoning traces. Crucially, we introduce automated validation via code execution and rule-based filtering to ensure logical correctness and structural diversity, followed by reverse-engineering filtered outputs into natural language instructions and language CoTs to enrich task adaptability. This closed-loop process enables fully automated, scalable synthesis of reasoning data with guaranteed executability. Experiments on our created \texttt{Caco}-1.3M dataset demonstrate that \texttt{Caco}-trained models achieve strong competitive performance on mathematical reasoning benchmarks, outperforming existing strong baselines. Further analysis reveals that \texttt{Caco}’s code-anchored verification and instruction diversity contribute to superior generalization across unseen tasks. Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.
ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation
Jiatong Shi · Yifan Cheng · Bo-Hao Su · Hye-jin Shim · Jinchuan Tian · Samuele Cornell · Yiwen Zhao · Siddhant Arora · Shinji Watanabe
Speech signal analysis poses significant challenges, particularly in tasks such as speech quality evaluation and profiling, where the goal is to predict multiple perceptual and objective metrics. For instance, metrics like PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score) each capture different aspects of speech quality. However, these metrics often have different scales, assumptions, and dependencies, making joint estimation non-trivial. To address these issues, we introduce ARECHO (Autoregressive Evaluation via Chain-based Hypothesis Optimization), a chain-based, versatile evaluation system for speech assessment grounded in autoregressive dependency modeling. ARECHO is distinguished by three key innovations: (1) a comprehensive speech information tokenization pipeline; (2) a dynamic classifier chain that explicitly captures inter-metric dependencies; and (3) a two-step confidence-oriented decoding algorithm that enhances inference reliability. Experiments demonstrate that ARECHO significantly outperforms the baseline framework across diverse evaluation scenarios, including enhanced speech analysis, speech generation evaluation, and noisy speech evaluation. Furthermore, its dynamic dependency modeling improves interpretability by capturing inter-metric relationships. Across tasks, ARECHO offers reference-free evaluation using its dynamic classifier chain to support subset queries (single or multiple metrics) and reduces error propagation via confidence-oriented decoding.
Metis: A Foundation Speech Generation Model with Masked Generative Pre-training
Yuancheng Wang · Jiachen Zheng · Junan Zhang · Xueyao Zhang · Huan Liao · Zhizheng Wu
We introduce Metis, a foundation model for unified speech generation. Unlike previous task-specific or multi-task models, Metis follows a pre-training and fine-tuning paradigm. It is pre-trained on large-scale unlabeled speech data using masked generative modeling and then fine-tuned to adapt to diverse speech generation tasks. Specifically, (1) Metis utilizes two discrete speech representations: SSL tokens derived from speech self-supervised learning (SSL) features, and acoustic tokens directly quantized from waveforms. (2) Metis performs masked generative pre-training on SSL tokens, utilizing 300K hours of diverse speech data, without any additional condition. (3) Through fine-tuning with task-specific conditions, Metis achieves efficient adaptation to various speech generation tasks while supporting multimodal input, even when using limited data and trainable parameters. Experiments demonstrate that Metis can serve as a foundation model for unified speech generation: Metis outperforms state-of-the-art task-specific or multi-task systems across five speech generation tasks, including zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech, even with fewer than 20M trainable parameters or 300 times less training data. Audio samples are are available at https://metis-demo.github.io/. We release the code and model checkpoints at https://github.com/open-mmlab/Amphion.
This paper introduces Codified Profiles for role-playing, a novel approach that represents character logic as structured, executable functions for behavioral decision-making. Converted by large language model (LLM) from textual profiles, each codified profile defines a set of functions parsebyscene(scene) that output multiple logic-grounded assertions according to scene, using both explicit control structures (e.g., if-then-else) and flexible check_condition(scene, question) functions where each question is a semantically meaningful prompt about the scene (e.g., "Is the character in danger?") discriminated by the role-playing LLM as true, false, or unknown. This explicit representation offers three key advantages over traditional prompt-based textual profiles, which append character descriptions directly into text prompts: (1) Persistence, by enforcing complete and consistent execution of character logic, rather than relying on the model's implicit reasoning; (2) Updatability, through systematic inspection and revision of behavioral logic, which is difficult to track or debug in prompt-only approaches; (3) Controllable Randomness, by supporting stochastic behavior directly within the logic, enabling fine-grained variability that prompting alone struggles to achieve. To validate these advantages, we introduce a new benchmark constructed from 83 characters and 5,141 scenes curated from Fandom, using natural language inference (NLI)-based scoring to compare character responses against ground-truths. Our experiments demonstrate the significant benefits of codified profiles in improving persistence, updatability, and behavioral diversity. Notably, by offloading a significant portion of reasoning to preprocessing, codified profiles enable even 1B-parameter models to perform high-quality role-playing, providing an efficient, lightweight foundation for local deployment of role-play agents.
Distilling LLM Agent into Small Models with Retrieval and Code Tools
Minki Kang · Jongwon Jeong · Seanie Lee · Jaewoong Cho · Sung Ju Hwang
Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents.
Retrieval is Not Enough: Enhancing RAG through Test-Time Critique and Optimization
Jiaqi Wei · Hao Zhou · Xiang Zhang · Di Zhang · Zijie Qiu · Noah Wei · Jinzhe Li · Wanli Ouyang · Siqi Sun
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs). However, standard RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions. In this work, we reinterpret RAG as \textit{Retrieval-Augmented Reasoning} and identify a central but underexplored problem: \textit{Reasoning Misalignment}—the divergence between an LLM's internal reasoning trajectory and the evidential constraints provided by retrieval. To address this issue, we propose \textsc{AlignRAG}, a novel iterative framework grounded in \textit{Critique-Driven Alignment (CDA)}. We further introduce \textsc{AlignRAG-auto}, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations. At the heart of \textsc{AlignRAG} lies a \textit{contrastive critique synthesis} mechanism that generates retrieval-sensitive critiques while mitigating self-bias. This mechanism trains a dedicated retrieval-augmented \textit{Critic Language Model (CLM)} using labeled critiques that distinguish between evidence-aligned and misaligned reasoning. Empirical evaluations show that our approach significantly improves reasoning fidelity. Our 8B-parameter CLM improves performance over the Self-Refine baseline by \textbf{12.1\%} on out-of-domain tasks and outperforms a standard 72B-parameter CLM by \textbf{2.2\%}. Furthermore, \textsc{AlignRAG-auto} achieves this state-of-the-art performance while dynamically determining the optimal number of refinement steps, enhancing efficiency and usability. \textsc{AlignRAG} remains compatible with existing RAG architectures as a \textit{plug-and-play} module and demonstrates strong robustness under both informative and noisy retrieval scenarios. Overall, \textsc{AlignRAG} offers a principled solution for aligning model reasoning with retrieved evidence, substantially improving the factual reliability and robustness of RAG systems. Our source code is provided at \href{https://github.com/upup-wei/RAG-ReasonAlignment}{link}.
Causal LLM Routing: End-to-End Regret Minimization from Observational Data
Asterios Tsiourvas · Wei Sun · Georgia Perakis
LLM routing aims to select the most appropriate model for each query, balancing competing performance metrics such as accuracy and cost across a pool of language models. Prior approaches typically adopt a decoupled strategy, where the metrics are first predicted and the model is then selected based on these estimates. This setup is prone to compounding errors and often relies on full-feedback data, where each query is evaluated by all candidate models, which is costly to obtain and maintain in practice. In contrast, we learn from observational data, which records only the outcome of the model actually deployed. We propose a causal end-to-end framework that learns routing policies by minimizing decision-making regret from observational data. To enable efficient optimization, we introduce two theoretically grounded surrogate objectives: a classification-based upper bound, and a softmax-weighted regret approximation shown to recover the optimal policy at convergence. We further extend our framework to handle heterogeneous cost preferences via an interval-conditioned architecture. Experiments on public benchmarks show that our method outperforms existing baselines, achieving state-of-the-art performance across different embedding models.
Hyper-Modality Enhancement for Multimodal Sentiment Analysis with Missing Modalities
Yan Zhuang · Minhao Liu · Wei Bai · Yanru Zhang · Wei Li · Jiawen Deng · Fuji Ren
Multimodal Sentiment Analysis (MSA) aims to infer human emotions by integrating complementary signals from diverse modalities. However, in real-world scenarios, missing modalities are common due to data corruption, sensor failure, or privacy concerns, which can significantly degrade model performance. To tackle this challenge, we propose Hyper-Modality Enhancement (HME), a novel framework that avoids explicit modality reconstruction by enriching each observed modality with semantically relevant cues retrieved from other samples. This cross-sample enhancement reduces reliance on fully observed data during training, making the method better suited to scenarios with inherently incomplete inputs. In addition, we introduce an uncertainty-aware fusion mechanism that adaptively balances original and enriched representations to improve robustness. Extensive experiments on three public benchmarks show that HME consistently outperforms state-of-the-art methods under various missing modality conditions, demonstrating its practicality in real-world MSA applications.
Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward
Yanming Wan · Jiaxing Wu · Marwa Abdulhai · Lior Shani · Natasha Jaques
Effective conversational agents must personalize their interactions to adapt to user preferences, personalities, and attributes across diverse domains like education and healthcare. Current methods like Reinforcement Learning from Human Feedback (RLHF), often prioritize helpfulness and safety but fall short in fostering truly empathetic, adaptive, and personalized dialogues. Existing personalization approaches typically rely on extensive user history, limiting their effectiveness for new or context-limited users. To address these limitations, we propose leveraging a user model to incorporate a curiosity-based intrinsic reward into multi-turn RLHF. This novel reward mechanism encourages the agent to actively infer user traits by optimizing conversations to improve its user model's accuracy. Consequently, the agent delivers more personalized interactions by learning more about the user. We demonstrate our method's effectiveness in two distinct domains: significantly improving personalization performance in a conversational recommendation task, and personalizing conversations for different learning styles in an educational setting with improved generalization capabilities compared to traditional multi-turn RLHF, all while maintaining conversation quality. Our method offers a promising solution for creating more personalized, adaptive, and engaging conversational agents.
Mellow: a small audio language model for reasoning
Soham Deshmukh · Satvik Dixit · Rita Singh · Bhiksha Raj
Multimodal Audio-Language Models (ALMs) can understand and reason over both audio and text. Typically, reasoning performance correlates with model size, with the best results achieved by models exceeding 8 billion parameters. However, no prior work has explored enabling small audio-language models to perform reasoning tasks, despite the potential applications for edge devices. To address this gap, we introduce Mellow, a small Audio-Language Model specifically designed for reasoning. Mellow achieves state-of-the-art performance among existing small audio-language models and surpasses several larger models in reasoning capabilities. For instance, Mellow scores 52.11 on MMAU, comparable to SoTA Qwen2 Audio (which scores 52.5) while using 50 times fewer parameters and being trained on 60 times less data (audio hrs). To train Mellow, we introduce ReasonAQA, a dataset designed to enhance audio-grounded reasoning in models. It consists of a mixture of existing datasets (30\% of the data) and synthetically generated data (70\%). The synthetic dataset is derived from audio captioning datasets, where Large Language Models (LLMs) generate detailed and multiple-choice questions focusing on audio events, objects, acoustic scenes, signal properties, semantics, and listener emotions. To evaluate Mellow’s reasoning ability, we benchmark it on a diverse set of tasks, assessing on both in-distribution and out-of-distribution data, including audio understanding, deductive reasoning, and comparative reasoning. Finally, we conduct extensive ablation studies to explore the impact of projection layer choices, synthetic data generation methods, and language model pretraining on reasoning performance. Our training dataset, findings, and baseline pave the way for developing small ALMs capable of reasoning.
Feedback-Aware MCTS for Goal-Oriented Information Seeking
Harshita Chopra · Chirag Shah
Effective decision-making and problem-solving in conversational systems require the ability to identify and acquire missing information through targeted questioning. A key challenge lies in efficiently narrowing down a large space of possible outcomes by posing questions that minimize uncertainty. To address this, we introduce a novel framework that leverages Large Language Models (LLMs) to generate information-seeking questions, with Monte Carlo Tree Search (MCTS) to strategically select questions that maximize information gain, as a part of inference-time planning. Our primary contribution includes a hierarchical feedback mechanism that exploits past interaction patterns to guide future strategy. Specifically, each new problem is mapped to a cluster based on semantic similarity, and our UCT (Upper Confidence bound for Trees) formulation employs a cluster-specific bonus reward to prioritize successful question trajectories that have proven effective for similar problems in the past. Extensive empirical evaluation across medical diagnosis and technical troubleshooting domains shows that our method achieves an average of 12\% improvement in success rates and about 10x reduction in the number of LLM calls made for planning per conversation, compared to the state of the art. An additional 8\% gain in success rate is observed on average when we start with a constrained set of possibilities. Our results underscore the efficacy of feedback-aware MCTS in enhancing information-seeking in goal-oriented dialogues.
Hyperbolic Dataset Distillation
Wenyuan Li · Guang Li · Keisuke Maeda · Takahiro Ogawa · Miki Haseyama
To address the computational and storage challenges posed by large-scale datasets in deep learning, dataset distillation has been proposed to synthesize a compact dataset that replaces the original while maintaining comparable model performance. Unlike optimization-based approaches that require costly bi-level optimization, distribution matching (DM) methods improve efficiency by aligning the distributions of synthetic and original data, thereby eliminating nested optimization. DM achieves high computational efficiency and has emerged as a promising solution. However, existing DM methods, constrained to Euclidean space, treat data as independent and identically distributed points, overlooking complex geometric and hierarchical relationships. To overcome this limitation, we propose a novel hyperbolic dataset distillation method, termed HDD. Hyperbolic space, characterized by negative curvature and exponential volume growth with distance, naturally models hierarchical and tree-like structures. HDD embeds features extracted by a shallow network into the Lorentz hyperbolic space, where the discrepancy between synthetic and original data is measured by the hyperbolic (geodesic) distance between their centroids. By optimizing this distance, the hierarchical structure is explicitly integrated into the distillation process, guiding synthetic samples to gravitate towards the root-centric regions of the original data distribution while preserving their underlying geometric characteristics. Furthermore, we find that pruning in hyperbolic space requires only 20\% of the distilled core set to retain model performance, while significantly improving training stability. Notably, HDD is seamlessly compatible with most existing DM methods, and extensive experiments on different datasets validate its effectiveness. To the best of our knowledge, this is the first work to incorporate the hyperbolic space into the dataset distillation process. The code is available at https://github.com/Guang000/HDD.
Learning “Partner-Aware” Collaborators in Multi-Party Collaboration
Abhijnan Nath · Nikhil Krishnaswamy
Large Language Models (LLMs) are increasingly bring deployed in agentic settings where they act as collaborators with humans. Therefore, it is increasingly important to be able to evaluate their abilities to collaborate effectively in multi-turn, multi-party tasks. In this paper, we build on the AI alignment and “safe interruptability” literature to offer novel theoretical insights on collaborative behavior between LLM-driven collaborator agents and an intervention agent. Our goal is to learn an ideal “partner-aware” collaborator that increases the group’s common-ground (CG)—alignment on task-relevant propositions—by intelligently collecting information provided in interventions by a partner agent.We show how LLM agents trained using standard RLHF and related approaches are naturally inclined to ignore possibly well-meaning interventions, which makes increasing group common ground non-trivial in this setting. We employ a two-player Modified-Action MDP to examine this suboptimal behavior of standard AI agents, and propose Interruptible Collaborative Roleplayer (ICR)—a novel “partner-aware” learning algorithm to train CG-optimal collaborators. Experiments on multiple collaborative task environments show that ICR, on average, is more capable of promoting successful CG convergence and exploring more diverse solutions in such tasks.
Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems
Shangbin Feng · Zifeng Wang · Palash Goyal · Yike Wang · Weijia Shi · Huang Xia · Hamid Palangi · Luke Zettlemoyer · Yulia Tsvetkov · Chen-Yu Lee · Tomas Pfister
We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by jointly optimizing model roles and weights. We represent multi-LLM systems as directed acyclic graphs (DAGs) of LLMs with topological message passing for collaborative generation. Given a pool of LLM experts and a utility function, Heterogeneous Swarms employs two iterative steps: role-step and weight-step. For role-step, we interpret model roles as learning a DAG that specifies the flow of inputs and outputs between LLMs. Starting from a swarm of random continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs in topological order, evaluate on the utility function (e.g. accuracy on a task), and optimize the adjacency matrices with particle swarm optimization based on the utility score. For weight-step, we assess the contribution of individual LLMs in the multi-LLM systems and optimize model weights with swarm intelligence. We propose JFK-score to quantify the individual contribution of each LLM in the best-found DAG of the role-step, then optimize model weights with particle swarm optimization based on the JFK-score. Experiments demonstrate that Heterogeneous Swarms outperforms 17 role- and/or weight-based baselines by 18.5% on average across 12 tasks. Further analysis reveals that Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles and substantial collaborative gains, and benefits from the diversity of language models.
CleverBirds: A Multiple-Choice Benchmark for Fine-grained Human Knowledge Tracing
Leonie Bossemeyer · Samuel Heinrich · Grant Van Horn · Oisin Mac Aodha
Mastering fine-grained visual recognition, essential in many expert domains, can require that specialists undergo years of dedicated training. Modeling the progression of such expertize in humans remains challenging, and accurately inferring a human learner’s knowledge state is a key step toward understanding visual learning. We introduce CleverBirds, a large-scale knowledge tracing benchmark for fine-grained bird species recognition. Collected by the citizen-science platform eBird, it offers insight into how individuals acquire expertize in complex fine-grained classification. More than 40,000 participants have engaged in the quiz, answering over 17 million multiple-choice questions spanning over 10,000 bird species, with long-range learning patterns across an average of 400 questions per participant. We release this dataset to support the development and evaluation of new methods for visual knowledge tracing. We show that tracking learners' knowledge is challenging, especially across participant subgroups and question types, with different forms of contextual information offering varying degrees of predictive benefit. CleverBirds is among the largest benchmark of its kind, offering a substantially higher number of learnable concepts. With it, we hope to enable new avenues for studying the development of visual expertize over time and across individuals.
EyeBench: Predictive Modeling from Eye Movements in Reading
Omer Shubi · David Robert Reich · Keren Gruteke Klein · Yuval Angel · Paul Prasse · Lena A. Jäger · Yevgeni Berzak
We present EyeBench, the first benchmark designed to evaluate machine learning models that decode cognitive and linguistic information from eye movements during reading. EyeBench offers an accessible entry point to the challenging and underexplored domain of modeling eye tracking data paired with text, aiming to foster innovation at the intersection of multimodal AI and cognitive science. The benchmark provides a standardized evaluation framework for predictive models, covering a diverse set of datasets and tasks, ranging from assessment of reading comprehension to detection of developmental dyslexia. Progress on the EyeBench challenge will pave the way for both practical real-world applications, such as adaptive user interfaces and personalized education, and scientific advances in understanding human language processing. The benchmark is released as an open-source software package which includes data downloading and harmonization scripts, baselines and state-of-the-art models, as well as evaluation code, publicly available at https://github.com/EyeBench/eyebench.
MUniverse: A Simulation and Benchmarking Suite for Motor Unit Decomposition
Pranav Mamidanna · Thomas Klotz · Dimitrios Chalatsis · Agnese Grison · Irene Mendez Guerra · Shihan Ma · Arnault Caillet · Simon Avrillon · Robin Rohlén · Dario Farina
Neural source separation enables the extraction of individual spike trains from complex electrophysiological recordings. When applied to electromyographic (EMG) signals, it provides a unique window into the motor output of the nervous system by isolating the spiking activity of motor units (MUs). MU decomposition from EMG signals is currently the only scalable neural interfacing approach available in behaving humans and has become foundational in motor neuroscience and neuroprosthetics. However, unlike related domains such as spike sorting or electroencephalography (EEG) analysis, decomposition of EMG signals lacks open benchmarks that reflect the diversity of muscles, movement contexts, and noise sources encountered in practice.To address this gap, we introduce MUniverse, a modular simulation and benchmarking suite for decomposing EMG signals into individual MU spiking activity. MUniverse provides: (1) a simulation stack with a user-friendly interface to a state-of-the-art EMG generator; (2) a curated library of datasets across synthetic, hybrid synthetic-real data with ground truth spikes, and experimental EMG; (3) a set of internal and external decomposition pipelines; and (4) a unified benchmark with well-defined tasks, standard evaluation metrics, and baseline results from established decomposition pipelines.MUniverse is designed for extensibility, reproducibility, and community use, and all datasets are distributed with standardised metadata (Croissant, BIDS). By standardising evaluation and enabling dataset simulation at scale, MUniverse aims to catalyze progress on this long-standing neural signal processing problem.
Group-in-Group Policy Optimization for LLM Agent Training
Lang Feng · Zhenghai Xue · Tingcong Liu · Bo An
Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to multi-turn LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a two-level structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on challenging agent benchmarks, including ALFWorld and WebShop, as well as tool-integrated reasoning on search-augmented QA tasks, using Qwen2.5-1.5B/3B/7B-Instruct. Crucially, GiGPO delivers fine-grained per-step credit signals, achieves performance gains of > 12\% on ALFWorld and > 9\% on WebShop over GRPO, and obtains superior performance on QA tasks (42.1\% on 3B and 47.2\% on 7B): all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.
OPHR: Mastering Volatility Trading with Multi-Agent Deep Reinforcement Learning
Zeting Chen · Xinyu Cai · Molei Qin · Bo An
Options markets represent one of the most sophisticated segments of the financial ecosystem, with prices that directly reflect market uncertainty. In this paper, we introduce the first reinforcement learning (RL) framework specifically designed for volatility trading through options, focusing on profit from the difference between implied volatility and realized volatility. Our multi-agent architecture consists of an Option Position Agent (OP-Agent) responsible for volatility timing by controlling long/short volatility positions, and a Hedger Routing Agent (HR-Agent) that manages risk and maximizes path-dependent profits by selecting optimal hedging strategies with different risk preferences. Evaluating our approach using cryptocurrency options data from 2021-2024, we demonstrate superior performance on BTC and ETH, significantly outperforming traditional strategies and machine learning baselines across all profit and risk-adjusted metrics while exhibiting sophisticated trading behavior. The code framework and sample data of this paper have been released on https://github.com/Edwicn/OPHR-MasteringVolatilityTradingwithMultiAgentDeepReinforcementLearning
Influence Guided Context Selection for Effective Retrieval-Augmented Generation
Jiale Deng · Yanyan Shen · Ziyuan Pei · Youmin Chen · Linpeng Huang
Retrieval-Augmented Generation (RAG) addresses large language model (LLM) hallucinations by grounding responses in external knowledge, but its effectiveness is compromised by poor-quality retrieved contexts containing irrelevant or noisy information. While existing approaches attempt to improve performance through context selection based on predefined context quality assessment metrics, they show limited gains over standard RAG. We attribute this limitation to their failure in holistically utilizing available information (query, context list, and generator) for comprehensive quality assessment. Inspired by recent advances in data selection, we reconceptualize context quality assessment as an inference-time data valuation problem and introduce the Contextual Influence Value (CI value). This novel metric quantifies context quality by measuring the performance degradation when removing each context from the list, effectively integrating query-aware relevance, list-aware uniqueness, and generator-aware alignment. Moreover, CI value eliminates complex selection hyperparameter tuning by simply retaining contexts with positive CI values. To address practical challenges of label dependency and computational overhead, we develop a parameterized surrogate model for CI value prediction during inference. The model employs a hierarchical architecture that captures both local query-context relevance and global inter-context interactions, trained through oracle CI value supervision and end-to-end generator feedback. Extensive experiments across 8 NLP tasks and multiple LLMs demonstrate that our context selection method significantly outperforms state-of-the-art baselines, effectively filtering poor-quality contexts while preserving critical information. Code is available at https://github.com/SJTU-DMTai/RAG-CSM.
A Sustainable AI Economy Needs Data Deals That Work for Generators
Ruoxi Jia · Luis Oala · Wenjie Xiong · Suqin Ge · Jiachen (Tianhao) Wang · Feiyang Kang · Dawn Song
We argue that the machine learning value chain is structurally unsustainable due to an economic data processing inequality: each state in the data cycle from inputs to model weights to synthetic outputs refines technical signal but strips economic equity from data generators. We show, by analyzing seventy-three public data deals, that the majority of value accrues to aggregators, with documented creator royalties rounding to zero and widespread opacity of deal terms. This is not just an economic welfare concern: as data and its derivatives become economic assets, the feedback loop that sustains current learning algorithms is at risk. We identify three structural faults - missing provenance, asymmetric bargaining power, and non-dynamic pricing - as the operational machinery of this inequality. In our analysis, we trace these problems along the machine learning value chain and propose an Equitable Data-Value Exchange (EDVEX) Framework to enable a minimal market that benefits all participants. Finally, we outline research directions where our community can make concrete contributions to data deals and contextualize our position with related and orthogonal viewpoints.
Position: AI Should Sense Better, Not Just Scale Bigger: Adaptive Sensing as a Paradigm Shift
Eunsu Baek · Keondo Park · Jeonggil Ko · Min-hwan Oh · Taesik Gong · Hyung-Sin Kim
Current AI advances largely rely on scaling neural models and expanding training datasets to achieve generalization and robustness. Despite notable successes, this paradigm incurs significant environmental, economic, and ethical costs, limiting sustainability and equitable access. Inspired by biological sensory systems, where adaptation occurs dynamically at the input (e.g., adjusting pupil size, refocusing vision)—we advocate for adaptive sensing as a necessary and foundational shift. Adaptive sensing proactively modulates sensor parameters (e.g., exposure, sensitivity, multimodal configurations) at the input level, significantly mitigating covariate shifts and improving efficiency. Empirical evidence from recent studies demonstrates that adaptive sensing enables small models (e.g., EfficientNet-B0) to surpass substantially larger models (e.g., OpenCLIP-H) trained with significantly more data and compute. We (i) outline a roadmap for broadly integrating adaptive sensing into real-world applications spanning humanoid, healthcare, autonomous systems, agriculture, and environmental monitoring, (ii) critically assess technical and ethical integration challenges, and (iii) propose targeted research directions, such as standardized benchmarks, real-time adaptive algorithms, multimodal integration, and privacy-preserving methods. Collectively, these efforts aim to transition the AI community toward sustainable, robust, and equitable artificial intelligence systems.
OctoNet: A Large-Scale Multi-Modal Dataset for Human Activity Understanding Grounded in Motion-Captured 3D Pose Labels
Dongsheng Yuan · Xie Zhang · Weiying Hou · Sheng Lyu · Yuemin Yu · Luca Jiang-Tao Yu · Chengxiao Li · Chenshu Wu
We introduce OctoNet, a large-scale, multi-modal, multi-view human activity dataset designed to advance human activity understanding and multi-modal learning. OctoNet comprises 12 heterogeneous modalities (including RGB, depth, thermal cameras, infrared arrays, audio, millimeter-wave radar, Wi-Fi, IMU, and more) recorded from 41 participants under multi-view sensor setups, yielding over 67.72M synchronized frames. The data encompass 62 daily activities spanning structured routines, freestyle behaviors, human-environment interaction, healthcare tasks, etc. Critically, all modalities are annotated by high-fidelity 3D pose labels captured via a professional motion-capture system, allowing precise alignment and rich supervision across sensors and views. OctoNet is one of the most comprehensive datasets of its kind, enabling a wide range of learning tasks such as human activity recognition, 3D pose estimation, multi-modal fusion, cross-modal supervision, and sensor foundation models. Extensive experiments have been conducted to demonstrate the sensing capacity using various baselines. OctoNet offers a unique and unified testbed for developing and benchmarking generalizable, robust models for human-centric perceptual AI.
Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs
Yifan Wei · Xiaoyan Yu · Tengfei Pan · Angsheng Li · Li Du
Large language models (LLMs) have achieved unprecedented performance by leveraging vast pretraining corpora, yet their performance remains suboptimal in knowledge-intensive domains such as medicine and scientific research, where high factual precision is required. While synthetic data provides a promising avenue for augmenting domain knowledge, existing methods frequently generate redundant samples that do not align with the model’s true knowledge gaps. To overcome this limitation, we propose a novel Structural Entropy-guided Knowledge Navigator (SENATOR) framework that addresses the intrinsic knowledge deficiencies of LLMs. Our approach employs the Structure Entropy (SE) metric to quantify uncertainty along knowledge graph paths and leverages Monte Carlo Tree Search (MCTS) to selectively explore regions where the model lacks domain-specific knowledge. Guided by these insights, the framework generates targeted synthetic data for supervised fine-tuning, enabling continuous self-improvement. Experimental results on LLaMA-3 and Qwen2 across multiple domain-specific benchmarks show that SENATOR effectively detects and repairs knowledge deficiencies, achieving notable performance improvements.
A Clean Slate for Offline Reinforcement Learning
Matthew T Jackson · Uljad Berdica · Jarek Liesen · Shimon Whiteson · Jakob Foerster
Progress in offline reinforcement learning (RL) has been impeded by ambiguous problem definitions and entangled algorithmic designs, resulting in inconsistent implementations, insufficient ablations, and unfair evaluations. Although offline RL explicitly avoids environment interaction, prior methods frequently employ extensive, undocumented online evaluation for hyperparameter tuning, complicating method comparisons. Moreover, existing reference implementations differ significantly in boilerplate code, obscuring their core algorithmic contributions. We address these challenges by first introducing a rigorous taxonomy and a transparent evaluation protocol that explicitly quantifies online tuning budgets. To resolve opaque algorithmic design, we provide clean, minimalistic, single-file implementations of various model-free and model-based offline RL methods, significantly enhancing clarity and achieving substantial speed-ups. Leveraging these streamlined implementations, we propose Unifloral, a unified algorithm that encapsulates diverse prior approaches and enables development within a single, comprehensive hyperparameter space. Using Unifloral with our rigorous evaluation protocol, we develop two novel algorithms - TD3-AWR (model-free) and MoBRAC (model-based) - which substantially outperform established baselines. Our implementation is publicly available at https://github.com/EmptyJackson/unifloral.
Adaptable Safe Policy Learning from Multi-task Data with Constraint Prioritized Decision Transformer
Ruiqi Xue · Ziqian Zhang · Lihe Li · Cong Guan · Lei Yuan · Yang Yu
Learning safe reinforcement learning (RL) policies from offline multi-task datasets without direct environmental interaction is crucial for efficient and reliable deployment of RL agents. Benefiting from their scalability and strong in-context learning capabilities, recent approaches attempt to utilize Decision Transformer (DT) architectures for offline safe RL, demonstrating promising adaptability across varying safety budgets. However, these methods primarily focus on single-constraint scenarios and struggle with diverse constraint configurations across multiple tasks. Additionally, their reliance on heuristically defined Return-To-Go (RTG) inputs limits flexibility and reduces learning efficiency, particularly in complex multi-task environments. To address these limitations, we propose CoPDT, a novel DT-based framework designed to enhance adaptability to diverse constraints and varying safety budgets. Specifically, CoPDT introduces a constraint prioritized prompt encoder, which leverages sparse binary cost signals to accurately identify constraints, and a constraint prioritized Return-To-Go (CPRTG) token mechanism, which dynamically generates RTGs based on identified constraints and corresponding safety budgets. Extensive experiments on the OSRL benchmark demonstrate that CoPDT achieves superior efficiency and significantly enhanced safety compliance across diverse multi-task scenarios, surpassing state-of-the-art DT-based methods by satisfying safety constraints in more than twice as many tasks.
Flattening Hierarchies with Policy Bootstrapping
John Zhou · Jonathan Kao
Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail. Project page: https://johnlyzhou.github.io/saw/
Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization
Subhojyoti Mukherjee · Viet Lai · Raghavendra Addanki · Ryan Rossi · Seunghyun Yoon · Trung Bui · Anup B. Rao · Jayakumar Subramanian · Branislav Kveton
Offline reinforcement learning (RL) is a variant of RL where the policy is learned from a previously collected dataset of trajectories and rewards. In our work, we propose a practical approach to offline RL with large language models (LLMs). We recast the problem as reward-weighted fine-tuning, which can be solved using similar techniques to supervised fine-tuning (SFT). To showcase the value of our approach, we apply it to learning short-horizon question-answering policies of a fixed length, where the agent reasons about potential answers or asks clarifying questions. Our work stands in a stark contrast to state-of-the-art methods in this domain, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize for rewards. We compare to them empirically, and report major gains in both optimized rewards and language quality.
Modeling Dynamic Neural Activity by combining Naturalistic Video Stimuli and Stimulus-independent Latent Factors
Finn Schmidt · Polina Turishcheva · Suhas Shrinivasan · Fabian Sinz
The neural activity in the visual processing is influenced by both external stimuli and internal brain states. Ideally, a neural predictive model should account for both of them. Currently, there are no dynamic encoding models that explicitly model a latent state and the entire neuronal response distribution. We address this gap by proposing a probabilistic model that predicts the joint distribution of the neuronal responses from video stimuli and stimulus-independent latent factors. After training and testing our model on mouse V1 neuronal responses, we find that it outperforms video-only models in terms of log-likelihood and achieves improvements in likelihood and correlation when conditioned on responses from other neurons. Furthermore, we find that the learned latent factors strongly correlate with mouse behavior and that they exhibit patterns related to the neurons’ position on the visual cortex, although the model was trained without behavior and cortical coordinates. Our findings demonstrate that unsupervised learning of latent factors from population responses can reveal biologically meaningful structure that bridges sensory processing and behavior, without requiring explicit behavioral annotations during training. The code is attached to the submission.
Efficient Training of Minimal and Maximal Low-Rank Recurrent Neural Networks
Anushri Arora · Jonathan Pillow
Low-rank recurrent neural networks (RNNs) provide a powerful framework for characterizing how neural systems solve complex cognitive tasks. However, fitting and interpreting these networks remains an important open problem. In this paper, we develop new methods for efficiently fitting low-rank RNNs in ''teacher-training'' settings. In particular, we build upon the neural engineering framework (NEF), in which RNNs are viewed as approximating an ordinary differential equation (ODE) of interest using a set of random nonlinear basis functions. This view provides geometric insight into how the choice of neural nonlinearity (e.g. tanh, ReLU) and the distribution of model parameters affects an RNN's representational capacity. We adapt this framework for online training and demonstrate better performance with significantly smaller networks compared to FORCE. Additionally, we outperform backpropagation-trained networks of similar size, while requiring substantially less training time. Next, we ask: how many neurons---and what distribution over their parameters---are needed to approximate a given dynamical system? To address this, we introduce methods for finding the smallest low-rank RNN to approximate a given dynamical system using an extension of orthogonal matching pursuit (OMP). We then consider infinite unit low-rank RNNs, which converge to a Gaussian Process (GP) over ODEs. In particular, we show that we can optimize the distribution over RNN parameters using the marginal likelihood under the equivalent GP covariance function---which can be computed in closed form for particular choices of nonlinearity. This results in substantially better training performance even for finite low-rank RNNs. Finally, we describe active learning methods for low-rank RNNs, which speed up training through the selection of maximally informative activity patterns.
Seemingly Redundant Modules Enhance Robust Odor Learning in Fruit Flies
HaiYang Li · Liao Yu · Qiang Yu · Yunliang Zang
Biological circuits have evolved to incorporate multiple modules that perform similar functions. In the fly olfactory circuit, both lateral inhibition (LI) and neuronal spike frequency adaptation (SFA) are thought to enhance pattern separation for odor learning. However, it remains unclear whether these mechanisms play redundant or distinct roles in this process. In this study, we present a computational model of the fly olfactory circuit to investigate odor discrimination under varying noise conditions that simulate complex environments. Our results show that LI primarily enhances odor discrimination in low‑ and medium‑noise scenarios, but this benefit diminishes and may reverse under higher‑noise conditions. In contrast, SFA consistently improves discrimination across all noise levels. LI is preferentially engaged in low‑ and medium‑noise environments, whereas SFA dominates in high‑noise settings. When combined, these two sparsification mechanisms enable optimal discrimination performance. This work demonstrates that seemingly redundant modules in biological circuits can, in fact, be essential for achieving optimal learning in complex contexts.
BrainOmni: A Brain Foundation Model for Unified EEG and MEG Signals
Qinfan Xiao · Ziyun Cui · Chi Zhang · SiQi Chen · Wen Wu · Andrew Thwaites · Alexandra Woolgar · Bowen Zhou · Chao Zhang
Electroencephalography (EEG) and magnetoencephalography (MEG) measure neural activity non-invasively by capturing electromagnetic fields generated by dendritic currents. Although rooted in the same biophysics, EEG and MEG exhibit distinct signal patterns, further complicated by variations in sensor configurations across modalities and recording devices. Existing approaches typically rely on separate, modality- and dataset-specific models, which limits the performance and cross-domain scalability. This paper proposes BrainOmni, the first brain foundation model that generalises across heterogeneous EEG and MEG recordings. To unify diverse data sources, we introduce BrainTokenizer, the first tokeniser that quantises spatiotemporal brain activity into discrete representations. Central to BrainTokenizer is a novel Sensor Encoder that encodes sensor properties such as spatial layout, orientation, and type, enabling compatibility across devices and modalities. Building upon the discrete representations, BrainOmni learns unified semantic embeddings of brain signals by self-supervised pretraining. To the best of our knowledge, it is the first foundation model to support both EEG and MEG signals, as well as the first to incorporate large-scale MEG pretraining. A total of 1,997 hours of EEG and 656 hours of MEG data are curated and standardised from publicly available sources for pretraining. Experiments show that BrainOmni outperforms both existing foundation models and state-of-the-art task-specific models on a range of downstream tasks. It also demonstrates strong generalisation to unseen EEG and MEG devices. Further analysis reveals that joint EEG-MEG (EMEG) training yields consistent improvements across both modalities. Code and checkpoints are publicly available at https://github.com/OpenTSLab/BrainOmni
Quantifying Task-relevant Similarities in Representations Using Decision Variable Correlations
Yu (Eric) Qian · Wilson Geisler · Xue-Xin Wei
Previous studies have compared neural activities in the visual cortex to representations in deep neural networks trained on image classification. Interestingly, while some suggest that their representations are highly similar, others argued the opposite. Here, we propose a new approach to characterize the similarity of the decision strategies of two observers (models or brains) using decision variable correlation (DVC). DVC quantifies the image-by-image correlation between the decoded decisions based on the internal neural representations in a classification task. Thus, it can capture task-relevant information rather than general representational alignment. We evaluate DVC using monkey V4/IT recordings and network models trained on image classification tasks. We find that model–model similarity is comparable to monkey-monkey similarity, whereas model–monkey similarity is consistently lower. Strikingly, DVC decreases with increasing network performance on ImageNet-1k. Adversarial training does not improve model–monkey similarity in task-relevant dimensions assessed using DVC, although it markedly increases the model–model similarity. Similarly, pre-training on larger datasets does not improve model–monkey similarity. These results suggest a divergence between the task-relevant representations in monkey V4/IT and those learned by models trained on image classification tasks.
Flexible inference for animal learning rules using neural networks
Yuhan Helena Liu · Victor Geadah · Jonathan Pillow
Understanding how animals learn is a central challenge in neuroscience, with growing relevance to the development of animal- or human-aligned artificial intelligence. However, existing approaches tend to assume fixed parametric forms for the learning rule (e.g., Q-learning, policy gradient), which may not accurately describe the complex forms of learning employed by animals in realistic settings. Here we address this gap by developing a framework to infer learning rules directly from behavioral data collected during de novo task learning. We assume that animals follow a decision policy parameterized by a generalized linear model (GLM), and we model their learning rule—the mapping from task covariates to per-trial weight updates—using a deep neural network (DNN). This formulation allows flexible, data-driven inference of learning rules while maintaining an interpretable form of the decision policy itself. To capture more complex learning dynamics, we introduce a recurrent neural network (RNN) variant that relaxes the Markovian assumption that learning depends solely on covariates of the current trial, allowing for learning rules that integrate information over multiple trials. Simulations demonstrate that the framework can recover ground-truth learning rules. We applied our DNN and RNN-based methods to a large behavioral dataset from mice learning to perform a sensory decision-making task and found that they outperformed traditional RL learning rules at predicting the learning trajectories of held-out mice. The inferred learning rules exhibited reward-history–dependent learning dynamics, with larger updates following sequences of rewarded trials. Overall, these methods provide a flexible framework for inferring learning rules from behavioral data in de novo learning tasks, setting the stage for improved animal training protocols and the development of behavioral digital twins.
Spike-timing-dependent Hebbian learning as noisy gradient descent
Niklas Dexheimer · Sascha Gaudlitz · Johannes Schmidt-Hieber
Hebbian learning is a key principle underlying learning in biological neural networks. We relate a Hebbian spike-timing-dependent plasticity rule to noisy gradient descent with respect to a non-convex loss function on the probability simplex. Despite the constant injection of noise and the non-convexity of the underlying optimization problem, one can rigorously prove that the considered Hebbian learning dynamic identifies the presynaptic neuron with the highest activity and that the convergence is exponentially fast in the number of iterations. This is non-standard and surprising as typically noisy gradient descent with fixed noise level only converges to a stationary regime where the noise causes the dynamic to fluctuate around a minimiser.
Neurons as Detectors of Coherent Sets in Sensory Dynamics
Joshua L Pughe-Sanford · Xuehao Ding · Jason Moore · Anirvan Sengupta · Charles Epstein · Philip Greengard · Dmitri Chklovskii
We model sensory streams as observations from high-dimensional stochastic dynamical systems and conceptualize sensory neurons as self-supervised learners of compact representations of such dynamics. From prior experience, neurons learn {\it coherent sets}—regions of stimulus state space whose trajectories evolve cohesively over finite times—and assign membership indices to new stimuli. Coherent sets are identified via spectral clustering of the {\it stochastic Koopman operator (SKO)}, where the sign pattern of a subdominant singular function partitions the state space into minimally coupled regions. For multivariate Ornstein–Uhlenbeck processes, this singular function reduces to a linear projection onto the dominant singular vector of the whitened state-transition matrix. Encoding this singular vector as a receptive field enables neurons to compute membership indices via the projection sign in a biologically plausible manner. Each neuron detects either a {\it predictive} coherent set (stimuli with common futures) or a {\it retrospective} coherent set (stimuli with common pasts), suggesting a functional dichotomy among neurons. Since neurons lack access to explicit dynamical equations, the requisite singular vectors must be estimated directly from data, for example, via past–future canonical correlation analysis on lag-vector representations—an approach that naturally extends to nonlinear dynamics. This framework provides a novel account of neuronal temporal filtering, the ubiquity of rectification in neural responses, and known functional dichotomies. Coherent-set clustering thus emerges as a fundamental computation underlying sensory processing and transferable to bio-inspired artificial systems.
Brain-Inspired fMRI-to-Text Decoding via Incremental and Wrap-Up Language Modeling
Wentao Lu · Dong Nie · Pengcheng Xue · Zheng Cui · Piji Li · Daoqiang Zhang · Xuyun Wen
Decoding natural language text from non-invasive brain signals, such as functional magnetic resonance imaging (fMRI), remains a central challenge in brain-computer interface research. While recent advances in large language models (LLMs) have enabled open-vocabulary fMRI-to-text decoding, existing frameworks typically process the entire fMRI sequence in a single step, leading to performance degradation when handling long input sequences due to memory overload and semantic drift. To address this limitation, we propose a brain-inspired sequential fMRI-to-text decoding framework that mimics the human cognitive strategy of segmented and inductive language processing. Specifically, we divide long fMRI time series into consecutive segments aligned with optimal language comprehension length. Each segment is decoded incrementally, followed by a wrap-up mechanism that summarizes the semantic content and incorporates it as prior knowledge into subsequent decoding steps. This sequence-wise approach alleviates memory burden and ensures semantic continuity across segments. In addition, we introduce a text-guided masking strategy integrated with a masked autoencoder (MAE) framework for fMRI representation learning. This method leverages attention distributions over key semantic tokens to selectively mask the corresponding fMRI time points, and employs MAE to guide the model toward focusing on neural activity at semantically salient moments, thereby enhancing the capability of fMRI embeddings to represent textual information. Experimental results on the two datasets demonstrate that our method significantly outperforms state-of-the-art approaches, with performance gains increasing as decoding length grows.
Toward Relative Positional Encoding in Spiking Transformers
Changze Lv · Yansen Wang · Dongqi Han · Yifei Shen · Xiaoqing Zheng · Xuanjing Huang · Dongsheng Li
Spiking neural networks (SNNs) are bio-inspired networks that mimic how neurons in the brain communicate through discrete spikes, which have great potential in various tasks due to their energy efficiency and temporal processing capabilities. SNNs with self-attention mechanisms (spiking Transformers) have recently shown great advancements in various tasks, and inspired by traditional Transformers, several studies have demonstrated that spiking absolute positional encoding can help capture sequential relationships for input data, enhancing the capabilities of spiking Transformers for tasks such as sequential modeling and image classification. However, how to incorporate relative positional information into SNNs remains a challenge. In this paper, we introduce several strategies to approximate relative positional encoding (RPE) in spiking Transformers while preserving the binary nature of spikes. Firstly, we formally prove that encoding relative distances with Gray Code ensures that the binary representations of positional indices maintain a constant Hamming distance whenever their decimal values differ by a power of two, and we propose Gray-PE based on this property. In addition, we propose another RPE method called Log-PE, which combines the logarithmic form of the relative distance matrix directly into the spiking attention map. Furthermore, we extend our RPE methods to a two-dimensional form, making them suitable for processing image patches. We evaluate our RPE methods on various tasks, including time series forecasting, text classification, and patch-based image classification, and the experimental results demonstrate a satisfying performance gain by incorporating our RPE methods across many architectures. Our results provide fresh perspectives on designing spiking Transformers to advance their sequential modeling capability, thereby expanding their applicability across various domains. Our code is available at https://github.com/microsoft/SeqSNN.
FANS: A Flatness-Aware Network Structure for Generalization in Offline Reinforcement Learning
Da Wang · Yi Ma · Ting Guo · Hongyao Tang · Wei Wei · Jiye Liang
Offline reinforcement learning (RL) aims to learn optimal policies from static datasets while enhancing generalization to out-of-distribution (OOD) data. To mitigate overfitting to suboptimal behaviors in offline datasets, existing methods often relax constraints on policy and data or extract informative patterns through data-driven techniques. However, there has been limited exploration into structurally guiding the optimization process toward flatter regions of the solution space that offer better generalization. Motivated by this observation, we present \textit{FANS}, a generalization-oriented structured network framework that promotes flatter and robust policy learning by guiding the optimization trajectory through modular architectural design. FANS comprises four key components: (1) Residual Blocks, which facilitate compact and expressive representations; (2) Gaussian Activation, which promotes smoother gradients; (3) Layer Normalization, which mitigates overfitting; and (4) Ensemble Modeling, which reduces estimation variance. By integrating FANS into a standard actor-critic framework, we highlight that this remarkably simple architecture achieves superior performance across various tasks compared to many existing advanced methods. Moreover, we validate the effectiveness of FANS in mitigating overestimation and promoting generalization, demonstrating the promising potential of architectural design in advancing offline RL.
AI-Generated Video Detection via Perceptual Straightening
Christian Internò · Robert Geirhos · Markus Olhofer · Sunny Liu · Barbara Hammer · David Klindt
The rapid advancement of generative AI enables highly realistic synthetic video, posing significant challenges for content authentication and raising urgent concerns about misuse. Existing detection methods often struggle with generalization and capturing subtle temporal inconsistencies. We propose $ReStraV$ ($Re$presentation $Stra$ightening for $V$ideo), a novel approach to distinguish natural from AI-generated videos. Inspired by the ``perceptual straightening'' hypothesis—which suggests real-world video trajectories become more straight in neural representation domain—we analyze deviations from this expected geometric property. Using a pre-trained self-supervised vision transformer (DINOv2), we quantify the temporal curvature and stepwise distance in the model's representation domain. We aggregate statistical and signals descriptors of these measures for each video and train a classifier. Our analysis shows that AI-generated videos exhibit significantly different curvature and distance patterns compared to real videos. A lightweight classifier achieves state-of-the-art detection performance (e.g., $97.17$ % accuracy and $98.63$ % AUROC on the VidProM benchmark, substantially outperforming existing image- and video-based methods. ReStraV is computationally efficient, it is offering a low-cost and effective detection solution. This work provides new insights into using neural representation geometry for AI-generated video detection.
Characterizing control between interacting subsystems with deep Jacobian estimation
Adam J. Eisen · Mitchell Ostrow · Sarthak Chandra · Leo Kozachkov · Earl Miller · Ila Fiete
Biological function arises through the dynamical interactions of multiple subsystems, including those between brain areas, within gene regulatory networks, and more. A common approach to understanding these systems is to model the dynamics of each subsystem and characterize communication between them. An alternative approach is through the lens of control theory: how the subsystems control one another. This approach involves inferring the directionality, strength, and contextual modulation of control between subsystems. However, methods for understanding subsystem control are typically linear and cannot adequately describe the rich contextual effects enabled by nonlinear complex systems. To bridge this gap, we devise a data-driven nonlinear control-theoretic framework to characterize subsystem interactions via the Jacobian of the dynamics. We address the challenge of learning Jacobians from time-series data by proposing the JacobianODE, a deep learning method that leverages properties of the Jacobian to directly estimate it for arbitrary dynamical systems from data alone. We show that JacobianODE models outperform existing Jacobian estimation methods on challenging systems, including high-dimensional chaos. Applying our approach to a multi-area recurrent neural network (RNN) trained on a working memory selection task, we show that the “sensory” area gains greater control over the “cognitive” area over learning. Furthermore, we leverage the JacobianODE to directly control the trained RNN, enabling precise manipulation of its behavior. Our work lays the foundation for a theoretically grounded and data-driven understanding of interactions among biological subsystems.
Identifying interactions across brain areas while accounting for individual-neuron dynamics with a Transformer-based variational autoencoder
Qi Xin · Robert E Kass
Advances in large-scale recording technologies now enable simultaneous measurements from multiple brain areas, offering new opportunities to study signal transmission across interacting components of neural circuits. However, neural responses exhibit substantial trial-to-trial variability, often driven by unobserved factors such as subtle changes in animal behavior or internal states. To prevent evolving background dynamics from contaminating identification of functional coupling, we developed a hybrid neural spike train model, GLM-Transformer, that incorporates flexible, deep latent variable models into a point process generalized linear model (GLM) having an interpretable component for cross-population interactions. A Transformer-based variational autoencoder captures nonstationary individual-neuron dynamics that vary across trials, while standard nonparametric regression GLM coupling terms provide estimates of directed interactions between neural populations. We incorporate a low-rank structure on population-to-population coupling effects to improve scalability. Across synthetic datasets and mechanistic simulations, GLM-Transformer recovers known coupling structure and remains robust to shared background fluctuations. When applied to the Allen Institute Visual Coding dataset, it identifies feedforward pathways consistent with established visual hierarchies. This work offers a step toward improved identification of neural population interactions, and contributes to ongoing efforts aimed at achieving interpretable results while harvesting the benefits of deep learning.
Anatomically inspired digital twins capture hierarchical object representations in visual cortex
Emanuele Luconi · Dario Liscai · Carlo Baldassi · Alessandro Marin Vargas · Alessandro Sanzeni
Invariant object recognition-the ability to identify objects despite changes in appearance-is a hallmark of visual processing in the brain, yet its understanding remains a central challenge in systems neuroscience. Artificial neural networks trained to predict neural responses to visual stimuli (“digital twins”) could provide a powerful framework for studying such complex computations in silico. However, while current models accurately capture single-neuron responses within individual visual areas, their ability to reproduce how populations of neurons represent object identity, and how these representations transform across the cortical hierarchy, remains largely unexplored. Here we examine key functional signatures observed experimentally and find that current models account for hierarchical changes in basic single-neuron properties, such as receptive field size, but fail to capture more complex population-level phenomena, particularly invariant object representations. To address this gap, we introduce a biologically inspired hierarchical readout scheme that mirrors cortical anatomy, modeling each visual area as a projection from a distinct depth within a shared core network. This approach significantly improves the prediction of population-level representational transformations, outperforming standard models that use only the final layer, as well as alternatives with modified architecture, regularization, and loss function. Our results suggest that incorporating anatomical information provides a strong inductive bias in digital twin models, enabling them to better capture general principles of brain function.
Self supervised learning for in vivo localization of microelectrode arrays using raw local field potential
Tianxiao He · Malhar Patel · Chenyi Li · Anna Maslarova · Mihály Vöröslakos · Nalini Ramanathan · Wei-Lun Hung · Gyorgy Buzsaki · Erdem Varol
Recent advances in large-scale neural recordings have enabled accurate decoding of behavior and cognitive states, yet decoding anatomical regions remains underexplored, despite being crucial for consistent targeting in multiday recordings and effective deep brain stimulation. Current approaches typically rely on external anatomical information, from atlas-based planning to post hoc histology, which are limited in precision, longitudinal applicability, and real-time feedback. In this work, we develop a self-supervised learning framework, Lfp2vec, to infer anatomical regions directly from the neural signal in vivo. We adapt an audio-pretrained transformer model by continuing self-supervised training on a large corpus of unlabeled local-field-potential (LFP) data, then fine-tuning for anatomical region decoding. Ablations show that combining out-of-domain initialization with in-domain self-supervision outperforms training from scratch. We demonstrate that our method achieves strong zero-shot generalization across different labs and probe geometries, and outperforming state-of-the-art self-supervised models on electrophysiology data. The learned embeddings form anatomically coherent clusters and transfer effectively to downstream tasks like disease classification with minimal fine-tuning. Altogether, our approach enables zero-shot prediction of brain regions in novel subjects, demonstrates that LFP signals encode rich anatomical information, and establishes self-supervised learning on raw LFP as a foundation to learn representations that can be tuned for diverse neural decoding tasks. Code to reproduce our results is found in the github repository at https://github.com/tianxiao18/lfp2vec.
Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?
Yihao Li · Saeed Salehi · Lyle Ungar · Konrad Kording
Object binding, the brain’s ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high‑level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term IsSameObject. We decode IsSameObject from patch embeddings across ViT layers using a similarity probe, which reaches over 90\% accuracy. Crucially, this object-binding capability emerges reliably in self-supervised ViTs (DINO, MAE, CLIP), but markedly weaker in ImageNet-supervised models, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that IsSameObject is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating IsSameObject from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. Our findings challenge the view that ViTs lack object binding and highlight how symbolic knowledge of “which parts belong together” emerges naturally in a connectionist system.
Dynamics-Aligned Latent Imagination in Contextual World Models for Zero-Shot Generalization
Frank Röder · Jan Benad · Manfred Eppe · Pradeep Banerjee
Real-world reinforcement learning demands adaptation to unseen environmental conditions without costly retraining. Contextual Markov Decision Processes (cMDP) model this challenge, but existing methods often require explicit context variables (e.g., friction, gravity), limiting their use when contexts are latent or hard to measure. We introduce Dynamics-Aligned Latent Imagination (DALI), a framework integrated within the Dreamer architecture that infers latent context representations from agent-environment interactions. By training a self-supervised encoder to predict forward dynamics, DALI generates actionable representations conditioning the world model and policy, bridging perception and control. We theoretically prove this encoder is essential for efficient context inference and robust generalization. DALI’s latent space enables counterfactual consistency: Perturbing a gravity-encoding dimension alters imagined rollouts in physically plausible ways. On challenging cMDP benchmarks, DALI achieves significant gains over context-unaware baselines, often surpassing context-aware baselines in extrapolation tasks, enabling zero-shot generalization to unseen contextual variations.
Leveraging Conditional Dependence for Efficient World Model Denoising
Shaowei Zhang · Jiahan Cao · Dian Cheng · Xunlan Zhou · Shenghua Wan · Le Gan · De-Chuan Zhan
Effective denoising is critical for managing complex visual inputs contaminated with noisy distractors in model-based reinforcement learning (RL). Current methods often oversimplify the decomposition of observations by neglecting the conditional dependence between task-relevant and task-irrelevant components given an observation. To address this limitation, we introduce CsDreamer, a model-based RL approach built upon the world model of Collider-structure Recurrent State-Space Model (CsRSSM). CsRSSM incorporates colliders to comprehensively model the denoising inference process and explicitly capture the conditional dependence. Furthermore, it employs a decoupling regularization to balance the influence of this conditional dependence. By accurately inferring a task-relevant state space, CsDreamer improves learning efficiency during rollouts. Experimental results demonstrate the effectiveness of CsRSSM in extracting task-relevant information, leading to CsDreamer outperforming existing approaches in environments characterized by complex noise interference.
Text-to-Decision Agent: Offline Meta-Reinforcement Learning from Natural Language Supervision
Shilin Zhang · Zican Hu · Wenhao Wu · Xinyi Xie · Jianxiang Tang · Chunlin Chen · Daoyi Dong · Yu Cheng · Zhenhong Sun · Zhi Wang
Offline meta-RL usually tackles generalization by inferring task beliefs from high-quality samples or warmup explorations. The restricted form limits their generality and usability since these supervision signals are expensive and even infeasible to acquire in advance for unseen tasks. Learning directly from the raw text about decision tasks is a promising alternative to leverage a much broader source of supervision. In the paper, we propose Text-to-Decision Agent (T2DA), a simple and scalable framework that supervises offline meta-RL with natural language. We first introduce a generalized world model to encode multi-task decision data into a dynamics-aware embedding space. Then, inspired by CLIP, we predict which textual description goes with which decision embedding, effectively bridging their semantic gap via contrastive language-decision pre-training and aligning the text embeddings to comprehend the environment dynamics. After training the text-conditioned generalist policy, the agent can directly realize zero-shot text-to-decision generation in response to language instructions. Comprehensive experiments on MuJoCo and Meta-World benchmarks show that T2DA facilitates high-capacity zero-shot generalization and outperforms various types of baselines. Our code is available at https://github.com/NJU-RL/T2DA.
Optimal Dynamic Regret by Transformers for Non-Stationary Reinforcement Learning
Baiyuan Chen · Shinji Ito · Masaaki Imaizumi
Transformers have demonstrated exceptional performance across a wide range of domains. While their ability to perform reinforcement learning in-context has been established both theoretically and empirically, their behavior in non-stationary environments remains less understood. In this study, we address this gap by showing that transformers can achieve nearly optimal dynamic regret bounds in non-stationary settings. We prove that transformers are capable of approximating strategies used to handle non-stationary environment, and can learn the approximator in the in-context learning setup. Our experiments further show that transformers can match or even outperform existing expert algorithms in such environments.
Physics-informed Value Learner for Offline Goal-Conditioned Reinforcement Learning
Vittorio Giammarino · Ruiqi Ni · Ahmed Qureshi
Offline Goal-Conditioned Reinforcement Learning (GCRL) holds great promise for domains such as autonomous navigation and locomotion, where collecting interactive data is costly and unsafe. However, it remains challenging in practice due to the need to learn from datasets with limited coverage of the state-action space and to generalize across long-horizon tasks. To improve on these challenges, we propose a Physics-informed (Pi) regularized loss for value learning, derived from the Eikonal Partial Differential Equation (PDE) and which induces a geometric inductive bias in the learned value function. Unlike generic gradient penalties that are primarily used to stabilize training, our formulation is grounded in continuous-time optimal control and encourages value functions to align with cost-to-go structures. The proposed regularizer is broadly compatible with temporal-difference-based value learning and can be integrated into existing Offline GCRL algorithms. When combined with Hierarchical Implicit Q-Learning (HIQL), the resulting method, Eikonal-regularized HIQL (Eik-HIQL), yields significant improvements in both performance and generalization, with pronounced gains in stitching regimes and large-scale navigation tasks. Code is available at https://github.com/VittorioGiammarino/Eik-HIQL.
BrainFlow: A Holistic Pathway of Dynamic Neural System on Manifold
Zhixuan Zhou · Tingting Dan · Guorong Wu
A fundamental challenge in cognitive neuroscience is understanding how cognition emerges from the interplay between structural connectivity (SC) and dynamic functional connectivity (FC) in the brain. Network neuroscience has emerged as a powerful framework to understand brain function through a holistic perspective on structure-function relationships. In this context, current machine learning approaches typically seek to establish direct mappings between structural connectivity (SC) and functional connectivity (FC) associated with specific cognitive states. However, these state-independent methods often yield inconsistent results due to overlapping brain networks across cognitive states. To address this limitation, we conceptualize to uncover the dendritic coupling mechanism between one static SC and multiple FCs by solving a flow problem that bridges the distribution of SC to a mixed distribution of FCs, conditioned on various cognitive states, along a Riemannian manifold of symmetric positive-definite (SPD) manifold. We further prove the equivalence between flow matching on the SPD manifold and on the computationally efficient Cholesky manifold. Since a spare of functional connections is shared across cognitive states, we introduce the notion of consensus control to promote the shared kinetic structures between multiple FC-to-SC pathways via synchronized coordination, yielding a biologically meaningful underpinning on SC-FC coupling mechanism. Together, we present BrainFlow, a reversible generative model that achieves state-of-the-art performance on not only synthetic data but also large-scale neuroimaging datasets from UK Biobank and Human Connectome Project.
Dimensionality Mismatch Between Brains and Artificial Neural Networks
Santiago Galella · Maren Wehrheim · Matthias Kaschube
Biological and artificial vision systems both rely on hierarchical architectures, yet it remains unclear how their representational geometry evolves across processing stages, and what functional consequences may arise from potential differences. In this work, we systematically quantify and compare the linear and non-linear dimensionality of human brain activity (fMRI) and artificial neural networks (ANNs) during natural image viewing. In the human ventral visual stream, both dimensionality measures increase along the visual hierarchy, supporting the emergence of semantic and abstract representations. For linear dimensionality, most ANNs show a similar increase, but only for pooled features, emphasizing the importance of appropriate feature readouts in brain–model comparisons. In contrast, nonlinear dimensionality shows a collapse in the later layers of ANNs, pointing at a mismatch in representational geometry between the human and artificial visual systems. This mismatch may have functional consequences: while high-dimensional brain representations support flexible generalization to abstract features, ANNs appear to lose this capacity in later layers, where their representations become overly compressed. Overall, our findings propose dimensionality alignment as a benchmark for building more flexible and biologically grounded vision models.
TractoTransformer: Diffusion MRI Streamline Tractography using CNN and Transformer Networks
Itzik Waizman · Yakov Gusakov · Itay Benou · Tammy Riklin Raviv
White matter tractography is an advanced neuroimaging technique that reconstructs the 3D white matter pathways of the brain from diffusion MRI data. It can be framed as a pathfinding problem aiming to infer neural fiber trajectories from noisy and ambiguous measurements, facing challenges such as crossing, merging, and fanning white-matter configurations. In this paper, we propose a novel tractography method that leverages Transformers to model the sequential nature of white matter streamlines, enabling the prediction of fiber directions by integrating both the trajectory context and current diffusion MRI measurements. To incorporate spatial information, we utilize CNNs that extract microstructural features from local neighborhoods around each voxel. By combining these complementary sources of information, our approach improves the precision and completeness of neural pathway mapping compared to traditional tractography models. We evaluate our method with the Tractometer toolkit, achieving competitive performance against state-of-the-art approaches, and present qualitative results on the TractoInferno dataset, demonstrating strong generalization to real-world data. The code attached to this submission will be made publicly available upon acceptance.
BioOSS: A Bio-Inspired Oscillatory State System with Spatio-Temporal Dynamics
Zhongju Yuan · Geraint Wiggins · Dick Botteldooren
Today’s deep learning architectures are primarily based on perceptron models, which do not capture the oscillatory dynamics characteristic of biological neurons. Although oscillatory systems have recently gained attention for their closer resemblance to neural behavior, they still fall short of modeling the intricate spatio-temporal interactions observed in natural neural circuits. In this paper, we propose a bio-inspired oscillatory state system (BioOSS) designed to emulate the wave-like propagation dynamics critical to neural processing, particularly in the prefrontal cortex (PFC), where complex activity patterns emerge. BioOSS comprises two interacting populations of neurons: p neurons, which represent simplified membrane-potential-like units inspired by pyramidal cells in cortical columns, and o neurons, which govern propagation velocities and modulate the lateral spread of activity. Through local interactions, these neurons produce wave-like propagation patterns. The model incorporates trainable parameters for damping and propagation speed, enabling flexible adaptation to task-specific spatio-temporal structures. We evaluate BioOSS on both synthetic and real-world tasks, demonstrating superior performance and enhanced interpretability compared to alternative architectures.
Predicting Functional Brain Connectivity with Context-Aware Deep Neural Networks
Alexander Ratzan · Sidharth Goel · Junhao Wen · Christos Davatzikos · Erdem Varol
Spatial location and molecular interactions have long been linked to the connectivity patterns of neural circuits. Yet, at the macroscale of human brain networks, the interplay between spatial position, gene expression, and connectivity remains incompletely understood. Recent efforts to map the human transcriptome and connectome have yielded spatially resolved brain atlases, however modeling the relationship between high-dimensional transcriptomic data and connectivity while accounting for inherent spatial confounds presents a significant challenge. In this paper, we present the first deep learning approaches for predicting whole-brain functional connectivity from gene expression and regional spatial coordinates, including our proposed Spatiomolecular Transformer (SMT). SMT explicitly models biological context by tokenizing genes based on their transcription start site (TSS) order to capture multi-scale genomic organization, and incorporating regional 3D spatial location via a dedicated context [CLS] token within its multi-head self-attention mechanism. We rigorously benchmark context-aware neural networks, including SMT and a single-gene resolution Multilayer-Perceptron (MLP), to established rules-based and bilinear methods. Crucially, to ensure that learned relationships in any model are not mere artifacts of spatial proximity, we introduce novel spatiomolecular null maps, preserving both spatial and transcriptomic autocorrelation. Context-aware neural networks outperform linear methods, significantly exceed our stringent null shuffle models, and generalize across diverse connectomic datasets and parcellation resolutions. Together, these findings demonstrate a strong, predictable link between the spatial distributions of gene expression and functional brain network architecture, and establish a rigorously validated deep learning framework for decoding this relationship. Code to reproduce our results is available at: github.com/neuroinfolab/GeneEx2Conn.
$\mu$PC: Scaling Predictive Coding to 100+ Layer Networks
Francesco Innocenti · El Mehdi Achour · Christopher L Buckley
The biological implausibility of backpropagation (BP) has motivated many alternative, brain-inspired algorithms that attempt to rely only on local information, such as predictive coding (PC) and equilibrium propagation. However, these algorithms have notoriously struggled to train very deep networks, preventing them from competing with BP in large-scale settings. Indeed, scaling PC networks (PCNs) has recently been posed as a challenge for the community (Pinchetti et al., 2024). Here, we show that 100+ layer PCNs can be trained reliably using a Depth-$\mu$P parameterisation (Yang et al., 2023; Bordelon et al., 2023) which we call "$\mu$PC". By analysing the scaling behaviour of PCNs, we reveal several pathologies that make standard PCNs difficult to train at large depths. We then show that, despite addressing only some of these instabilities, $\mu$PC allows stable training of very deep (up to 128-layer) residual networks on simple classification tasks with competitive performance and little tuning compared to current benchmarks. Moreover, $\mu$PC enables zero-shot transfer of both weight and activity learning rates across widths and depths. Our results serve as a first step towards scaling PC to more complex architectures and have implications for other local algorithms. Code for $\mu$PC is made available as part of a JAX library for PCNs.
Contrastive Consolidation of Top-Down Modulations Achieves Sparsely Supervised Continual Learning
Viet Anh Khoa Tran · Emre Neftci · Willem Wybo
Biological brains learn continually from a stream of unlabeled data, while integrating specialized information from sparsely labeled examples without compromising their ability to generalize. Meanwhile, machine learning methods are susceptible to catastrophic forgetting in this natural learning setting, as supervised specialist fine-tuning degrades performance on the original task. We introduce task-modulated contrastive learning (TMCL), which takes inspiration from the biophysical machinery in the neocortex, using predictive coding principles to integrate top-down information continually and without supervision. We follow the idea that these principles build a view-invariant representation space, and that this can be implemented using a contrastive loss. Then, whenever labeled samples of a new class occur, new affine modulations are learned that improve separation of the new class from all others, without affecting feedforward weights. By co-opting the view-invariance learning mechanism, we then train feedforward weights to match the unmodulated representation of a data sample to its modulated counterparts. This introduces modulation invariance into the representation space, and, by also using past modulations, stabilizes it. Our experiments show improvements in both class-incremental and transfer learning over state-of-the-art unsupervised approaches, as well as over comparable supervised approaches, using as few as 1% of available labels. Taken together, our work suggests that top-down modulations play a crucial role in balancing stability and plasticity.
Are Large Language Models Sensitive to the Motives Behind Communication?
Addison J. Wu · Ryan Liu · Kerem Oktar · Ted Sumers · Tom Griffiths
Human communication is $\textit{motivated}$: people speak, write, and create content with a particular communicative intent in mind. As a result, information that large language models (LLMs) and AI agents process is inherently framed by humans' intentions and incentives. People are adept at navigating such nuanced information: we routinely identify benevolent or self-serving motives in order to decide what statements to trust. For LLMs to be effective in the real world, they too must critically evaluate content by factoring in the motivations of the source---for instance, weighing the credibility of claims made in a sales pitch. In this paper, we undertake a comprehensive study of whether LLMs have this capacity for $\textit{motivational vigilance}$. We first employ controlled experiments from cognitive science to verify that LLMs' behavior is consistent with rational models of learning from motivated testimony, and find they successfully discount information from biased sources in a human-like manner. We then extend our evaluation to sponsored online adverts, a more naturalistic reflection of LLM agents' information ecosystems. In these settings, we find that LLMs' inferences do not track the rational models' predictions nearly as closely---partly due to additional information that distracts them from vigilance-relevant considerations. However, a simple steering intervention that boosts the salience of intentions and incentives substantially increases the correspondence between LLMs and the rational model. These results suggest that LLMs possess a basic sensitivity to the motivations of others, but generalizing to novel real-world settings will require further improvements to these models.
Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement
Linyang He · Tianjun Zhong · Richard Antonello · Gavin Mischler · Micah Goldblum · Nima Mesgarani
Understanding how the human brain progresses from processing simple linguistic inputs to performing high-level reasoning is a fundamental challenge in neuroscience. While modern large language models (LLMs) are increasingly used to model neural responses to language, their internal representations are highly "entangled," mixing information about lexicon, syntax, meaning, and reasoning. This entanglement biases conventional brain encoding analyses toward linguistically shallow features (e.g., lexicon and syntax), making it difficult to isolate the neural substrates of cognitively deeper processes. Here, we introduce a residual disentanglement method that computationally isolates these components. By first probing an LM to identify feature-specific layers, our method iteratively regresses out lower-level representations to produce four nearly orthogonal embeddings for lexicon, syntax, meaning, and, critically, reasoning. We used these disentangled embeddings to model intracranial (ECoG) brain recordings from neurosurgical patients listening to natural speech. We show that: 1) This isolated reasoning embedding exhibits unique predictive power, accounting for variance in neural activity not explained by other linguistic features and even extending to the recruitment of visual regions beyond classical language areas. 2) The neural signature for reasoning is temporally distinct, peaking later (~350-400ms) than signals related to lexicon, syntax, and meaning, consistent with its position atop a processing hierarchy. 3) Standard, non-disentangled LLM embeddings can be misleading, as their predictive success is primarily attributable to linguistically shallow features, masking the more subtle contributions of deeper cognitive processing. Our work provides compelling neural evidence for an abstract reasoning computation during language comprehension and offers a robust framework for mapping distinct cognitive functions from artificial models to the human brain.
Modeling Neural Activity with Conditionally Linear Dynamical Systems
Victor Geadah · Amin Nejatbakhsh · David Lipshutz · Jonathan Pillow · Alex Williams
Neural population activity exhibits complex, nonlinear dynamics, varying in time, over trials, and across experimental conditions. Here, we develop Conditionally Linear Dynamical System (CLDS) models as a general-purpose method to characterize these dynamics. These models use Gaussian Process priors to capture the nonlinear dependence of circuit dynamics on task and behavioral variables. Conditioned on these covariates, the data is modeled with linear dynamics. This allows for transparent interpretation and tractable Bayesian inference. We find that CLDS models can perform well even in severely data-limited regimes (e.g. one trial per condition) due to their Bayesian formulation and ability to share statistical power across nearby task conditions. In example applications, we apply CLDS to model thalamic neurons that nonlinearly encode heading direction and to model motor cortical neurons during a cued-reaching task.
FAIR Universe HiggsML Uncertainty Dataset and Competition
Wahid Bhimji · Ragansu Chakkappai · Po-Wen Chang · Yuan-Tang Chou · Sascha Diefenbacher · Jordan Dudley · Ibrahim Elsharkawy · Steven Farrell · Aishik Ghosh · Cristina Giordano · Isabelle Guyon · Chris Harris · Yota Hashizume · Shih-Chieh Hsu · Elham E Khoda · Claudius Krause · Ang Li · Benjamin Nachman · David Rousseau · Robert Schöfbeck · Maryam Shooshtari · Dennis Schwarz · Ihsan Ullah · Daohan Wang · Yulei Zhang
The FAIR Universe – HiggsML Uncertainty Challenge focused on measuring the physical properties of elementary particles with imperfect simulators. Participants were required to compute and report confidence intervals for a parameter of interest regarding the Higgs boson while accounting for various systematic (epistemic) uncertainties. The dataset is a tabular dataset of 28 features and 280 million instances. Each instance represents a simulated proton-proton collision as observed at CERN’s Large Hadron Collider in Geneva, Switzerland. The features of these simulations were chosen to capture key characteristics of different types of particles. These include primary attributes, such as the energy and three-dimensional momentum of the particles, as well as derived attributes, which are calculated from the primary ones using domain-specific knowledge. Additionally, a label feature designates each instance’s type of proton-proton collision, distinguishing the Higgs boson events of interest from three background sources. As outlined in this paper, the permanent dataset release allows long-term benchmarking of new techniques. The leading submissions, including Contrastive Normalising Flows and Density Ratios estimation through classification, are described. Our challenge has brought together the physics and machine learning communities to advance our understanding and methodologies in handling systematic uncertainties within AI techniques.
ConStellaration: A dataset of QI-like stellarator plasma boundaries and optimization benchmarks
Santiago Cadena · Andrea Merlo · Emanuel Laude · Alexander Bauer · Atul Agrawal · Maria Pascu · Marija Savtchouk · Lukas Bonauer · Enrico Guiraud · Stuart Hudson · Markus Kaiser
Stellarators are magnetic confinement devices under active development to deliver steady-state carbon-free fusion energy. Their design involves a high-dimensional, constrained optimization problem that requires expensive physics simulations and significant domain expertise. Recent advances in plasma physics and open-source tools have made stellarator optimization more accessible. However, broader community progress is currently bottlenecked by the lack of standardized optimization problems with strong baselines and datasets that enable data-driven approaches, particularly for quasi-isodynamic (QI) stellarator configurations, considered as a promising path to commercial fusion due to their inherent resilience to current-driven disruptions. Here, we release an open dataset of diverse QI-like stellarator plasma boundary shapes, paired with their ideal magnetohydrodynamic (MHD) equilibria and performance metrics. We generated this dataset by sampling a variety of QI fields and optimizing corresponding stellarator plasma boundaries. We introduce three optimization benchmarks of increasing complexity: (1) a single-objective geometric optimization problem, (2) a "simple-to-build" QI stellarator, and (3) a multi-objective ideal-MHD stable QI stellarator that investigates trade-offs between compactness and coil simplicity. For every benchmark, we provide reference code, evaluation scripts, and strong baselines based on classical optimization techniques. Finally, we show how learned models trained on our dataset can efficiently generate novel, feasible configurations without querying expensive physics oracles. By openly releasing the dataset (https://huggingface.co/datasets/proxima-fusion/constellaration) along with benchmark problems and baselines (https://github.com/proximafusion/constellaration), we aim to lower the entry barrier for optimization and machine learning researchers to engage in stellarator design and to accelerate cross-disciplinary progress toward bringing fusion energy to the grid.
GlobalTomo: A global dataset for physics-ML seismic wavefield modeling and FWI
Shiqian Li · Zhi Li · Zhancun Mu · Shiji Xin · Zhixiang Dai · Kuangdai Leng · Rita Zhang · Xiaodong Song · Yixin Zhu
Global seismic tomography, taking advantage of seismic waves from natural earthquakes, provides essential insights into the earth's internal dynamics. Advanced Full-Waveform Inversion (FWI) techniques, whose aim is to meticulously interpret every detail in seismograms, confront formidable computational demands in forward modeling and adjoint simulations on a global scale. Recent advancements in Machine Learning (ML) offer a transformative potential for accelerating the computational efficiency of FWI and extending its applicability to larger scales. This work presents the first 3D global synthetic dataset tailored for seismic wavefield modeling and full-waveform tomography, referred to as the Global Tomography (GlobalTomo) dataset. This dataset is comprehensive, incorporating explicit wave physics and robust geophysical parameterization at realistic global scales, generated through state-of-the-art forward simulations optimized for 3D global wavefield calculations. Through extensive analysis and the establishment of ML baselines, we illustrate that ML approaches are particularly suitable for global FWI, overcoming its limitations with rapid forward modeling and flexible inversion strategies. This work represents a cross-disciplinary effort to enhance our understanding of the earth's interior through physics-ML modeling.
STAR: A Benchmark for Astronomical Star Fields Super-Resolution
WU KUO-CHENG · Guohang Zhuang · Jinyang Huang · Xiang Zhang · Wanli Ouyang · Yan Lu
Super-resolution (SR) advances astronomical imaging by enabling cost-effective high-resolution capture, crucial for detecting faraway celestial objects and precise structural analysis. However, existing datasets for astronomical SR (ASR) exhibit three critical limitations: flux inconsistency, object-crop setting, and insufficient data diversity, significantly impeding ASR development. We propose STAR, a large-scale astronomical SR dataset containing 54,738 flux-consistent star field image pairs covering wide celestial regions. These pairs combine Hubble Space Telescope high-resolution observations with physically faithful low-resolution counterparts generated through a flux-preserving data generation pipeline, enabling systematic development of field-level ASR models. To further empower the ASR community, STAR provides a novel Flux Error (FE) to evaluate SR models in physical view. Leveraging this benchmark, we propose a Flux-Invariant Super Resolution (FISR) model that could accurately infer the flux-consistent high-resolution images from input photometry, suppressing several SR state-of-the-art methods by 24.84% on a novel designed flux consistency metric, showing the priority of our method for astrophysics. Extensive experiments demonstrate the effectiveness of our proposed method and the value of our dataset. Code and models are available at https://github.com/GuoCheng12/STAR.
UniFoil: A Universal Dataset of Airfoils in Transitional and Turbulent Regimes for Subsonic and Transonic Flows
Rohit Kanchi · Benjamin Melanson · Nithin Somasekharan · Shaowu Pan · Sicheng He
We present UniFoil, the largest publicly available universal airfoil database based on Reynolds-Averaged Navier–Stokes (RANS) simulations. It contains over 500,000 samples spanning a wide range of Reynolds and Mach numbers, capturing both transitional and fully turbulent flows across incompressible to compressible regimes. UniFoil is designed to support machine learning research in fluid dynamics, particularly for modeling complex aerodynamic phenomena.Most existing datasets are limited to incompressible, fully turbulent flows with smooth field characteristics, thus overlooking the critical physics of laminar–turbulent transition and shock-wave interactions—features that exhibit strong nonlinearity and sharp gradients. UniFoil fills this gap by offering a broad spectrum of realistic flow conditions.In the database, turbulent simulations utilize the Spalart–Allmaras (SA) model, while transitional flows are modeled using an $e^N$-based transition prediction method coupled with the SA model. The database includes a comprehensive geometry set comprising over 4,800 natural laminar flow (NLF) airfoils and 30,000 fully turbulent (FT) airfoils, effectively covering the diversity of airfoil designs relevant to aerospace, wind energy, and marine applications.This database is also highly valuable for scientific machine learning (SciML), enabling the development of data-driven models that more accurately capture the transport processes associated with laminar–turbulent transition. UniFoil is freely available under a permissive CC-BY-SA license.
A Physics-preserved Transfer Learning Method for Differential Equations
Hao-Ran Yang · Chuan-Xian Ren
While data-driven methods such as neural operator have achieved great success in solving differential equations (DEs), they suffer from domain shift problems caused by different learning environments (with data bias or equation changes), which can be alleviated by transfer learning (TL). However, existing TL methods adopted in DEs problems lack either generalizability in general DEs problems or physics preservation during training. In this work, we focus on a general transfer learning method that adaptively correct the domain shift and preserve physical relation within the equation. Mathematically, we characterize the data domain as product distribution and the essential problems as distribution bias and operator bias. A Physics-preserved Optimal Tensor Transport (POTT) method that simultaneously admits generalizability to common DEs and physics preservation of specific problem is proposed to adapt the data-driven model to target domain, utilizing the pushforward distribution induced by the POTT map. Extensive experiments in simulation and real-world datasets demonstrate the superior performance, generalizability and physics preservation of the proposed POTT method.
C-NAV: Towards Self-Evolving Continual Object Navigation in Open World
MingMing Yu · Fei Zhu · Wenzhuo Liu · Yirong Yang · Qunbo Wang · wenjun wu · Jing Liu
Embodied agents are expected to perform object navigation in dynamic, open-world environments. However, existing approaches typically rely on static trajectories and a fixed set of object categories during training, overlooking the real-world requirement for continual adaptation to evolving scenarios. To facilitate related studies, we introduce the continual object navigation benchmark, which requires agents to acquire navigation skills for new object categories while avoiding catastrophic forgetting of previously learned knowledge. To tackle this challenge, we propose C-Nav, a continual visual navigation framework that integrates two key innovations: (1) A dual-path anti-forgetting mechanism, which comprises feature distillation that aligns multi-modal inputs into a consistent representation space to ensure representation consistency, and feature replay that retains temporal features within the action decoder to ensure policy consistency. (2) An adaptive sampling strategy that selects diverse and informative experiences, thereby reducing redundancy and minimizing memory overhead. Extensive experiments across multiple model architectures demonstrate that C-Nav consistently outperforms existing approaches, achieving superior performance even compared to baselines with full trajectory retention, while significantly lowering memory requirements. The code will be publicly available at \url{https://bigtree765.github.io/C-Nav-project}.
Real-World Reinforcement Learning of Active Perception Behaviors
Edward Hu · Jie Wang · Xingfang Yuan · Fiona Luo · Muyao Li · Gaspard Lambrechts · Oleh Rybkin · Dinesh Jayaraman
A robot's instantaneous sensory observations do not always reveal task-relevant state information. Under such partial observability, optimal behavior typically involves explicitly acting to gain the missing information. Today's standard robot learning techniques struggle to produce such active perception behaviors. We propose a simple real-world robot learning recipe to efficiently train active perception policies. Our approach, asymmetric advantage weighted regression (AAWR), exploits access to "privileged" extra sensors at training time. The privileged sensors enable training high-quality privileged value functions that aid in estimating the advantage of the target policy. Bootstrapping from a small number of potentially suboptimal demonstrations and an easy-to-obtain coarse policy initialization, AAWR quickly acquires active perception behaviors and boosts task performance. In evaluations on 8 manipulation tasks on 3 robots spanning varying degrees of partial observability, AAWR synthesizes reliable active perception behaviors that outperform all prior approaches. When initialized with a "generalist" robot policy that struggles with active perception tasks, AAWR efficiently generates information-gathering behaviors that allow it to operate under severe partial observability for manipulation tasks. Website: https://penn-pal-lab.github.io/aawr/
Dynamic Focused Masking for Autoregressive Embodied Occupancy Prediction
Yuan Sun · Julio Contreras · Jorge Ortiz
Visual autoregressive modeling has recently demonstrated potential in image tasks by enabling coarse-to-fine, next-level prediction. Most indoor 3D occupancy prediction methods, however, continue to rely on dense voxel grids and convolution-heavy backbones, which incur high computational costs when applying such coarse-to-fine frameworks. In contrast, cost-efficient alternatives based on Gaussian representations—particularly in the context of multi-scale autoregression—remain underexplored. To bridge this gap, we propose DFGauss, a dynamic focused masking framework for multi-scale 3D Gaussian representation. Unlike conventional approaches that refine voxel volumes or 2D projections, DFGauss directly operates in the 3D Gaussian parameter space, progressively refining representations across resolutions under hierarchical supervision. Each finer-scale Gaussian is conditioned on its coarser-level counterpart, forming a scale-wise autoregressive process. To further enhance efficiency, we introduce an importance-guided refinement strategy that selectively propagates informative Gaussians across scales, enabling spatially adaptive detail modeling. Experiments on 3D occupancy benchmarks demonstrate that DFGauss achieves competitive performance, highlighting the promise of autoregressive modeling for scalable 3D occupancy prediction.
Taccel: Scaling Up Vision-based Tactile Robotics via High-performance GPU Simulation
Yuyang Li · Wenxin Du · Chang Yu · Puhao Li · Zihang Zhao · Tengyu Liu · Chenfanfu Jiang · Yixin Zhu · Siyuan Huang
Tactile sensing is crucial for achieving human-level robotic capabilities in manipulation tasks. As a promising solution, Vision-based Tactile Sensors (VBTSs) offer high spatial resolution and cost-effectiveness, but present unique challenges in robotics for their complex physical characteristics and visual signal processing requirements. The lack of efficient and accurate simulation tools for VBTSs has significantly limited the scale and scope of tactile robotics research. We present Taccel, a high-performance simulation platform that integrates Incremental Potential Contact (IPC) and Affine Body Dynamics (ABD) to model robots, tactile sensors, and objects with both accuracy and unprecedented speed, achieving a total of 915 FPS with 4096 parallel environments. Unlike previous simulators that operate at sub-real-time speeds with limited parallelization, Taccel provides precise physics simulation and realistic tactile signals while supporting flexible robot-sensor configurations through user-friendly APIs. Through extensive validation in object recognition, robotic grasping, and articulated object manipulation, we demonstrate precise simulation and successful sim-to-real transfer. These capabilities position Taccel as a powerful tool for scaling up tactile robotics research and development, potentially transforming how robots interact with and understand their physical environment.
MutualVPR: A Mutual Learning Framework for Resolving Supervision Inconsistencies via Adaptive Clustering
Qiwen Gu · Xufei Wang · Junqiao Zhao · Siyue Tao · Tiantian Feng · Ziqiao Wang · Guang Chen
Visual Place Recognition (VPR) enables robust localization through image retrieval based on learned descriptors. However, drastic appearance variations of images at the same place caused by viewpoint changes can lead to inconsistent supervision signals, thereby degrading descriptor learning. Existing methods either rely on manually defined cropping rules or labeled data for view differentiation, but they suffer from two major limitations: (1) reliance on labels or handcrafted rules restricts generalization capability; (2) even within the same view direction, occlusions can introduce feature ambiguity. To address these issues, we propose MutualVPR, a mutual learning framework that integrates unsupervised view self-classification and descriptor learning. We first group images by geographic coordinates, then iteratively refine the clusters using K-means to dynamically assign place categories without manual labeling. Specifically, we adopt a DINOv2-based encoder to initialize the clustering. During training, the encoder and clustering co-evolve, progressively separating drastic appearance variations of the same place and enabling consistent supervision. Furthermore, we find that capturing fine-grained image differences at a place enhances robustness. Experiments demonstrate that MutualVPR achieves state-of-the-art (SOTA) performance across multiple datasets, validating the effectiveness of our framework in improving view direction generalization, occlusion robustness.
CogVLA: Cognition-Aligned Vision-Language-Action Models via Instruction-Driven Routing & Sparsification
Wei Li · Renshan Zhang · Rui Shao · Jie He · Liqiang Nie
Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment. Existing sparsification strategies—such as Mixture-of-Depths, layer skipping, and early exit—fall short by neglecting the semantic coupling across vision-language-action modalities, and focusing narrowly on intra-LLM computation while overlooking end-to-end coherence from perception to control. To address these challenges, we propose **CogVLA**, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) **Encoder-FiLM based Aggregation Routing (EFA-Routing)** injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, **LLM-FiLM based Pruning Routing (LFP-Routing)** introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce **V‑L‑A Coupled Attention (CAtten)**, which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4\% and 70.0\%, respectively, while reducing training costs by 2.5$\times$ and decreasing inference latency by 2.8$\times$ compared to OpenVLA.
Towards Reliable LLM-based Robots Planning via Combined Uncertainty Estimation
Shiyuan Yin · Chenjia Bai · Zihao Zhang · Junwei Jin · Xinxin Zhang · Chi Zhang · Xuelong Li
Large language models (LLMs) demonstrate advanced reasoning abilities, enabling robots to understand natural language instructions and generate high-level plans with appropriate grounding. However, LLM hallucinations present a significant challenge, often leading to overconfident yet potentially misaligned or unsafe plans. While researchers have explored uncertainty estimation to improve the reliability of LLM-based planning, existing studies have not sufficiently differentiated between epistemic and intrinsic uncertainty, limiting the effectiveness of uncertainty estimation. In this paper, we present Combined Uncertainty estimation for Reliable Embodied planning (CURE), which decomposes the uncertainty into epistemic and intrinsic uncertainty, each estimated separately. Furthermore, epistemic uncertainty is subdivided into task clarity and task familiarity for more accurate evaluation. The overall uncertainty assessments are obtained using random network distillation and multi-layer perceptron regression heads driven by LLM features. We validated our approach in two distinct experimental settings: kitchen manipulation and tabletop rearrangement experiments. The results show that, compared to existing methods, our approach yields uncertainty estimates that are more closely aligned with the actual execution outcomes. The code is at https://github.com/Firesuiry/CURE.
Fast-in-Slow: A Dual-System VLA Model Unifying Fast Manipulation within Slow Reasoning
Hao Chen · Jiaming Liu · Chenyang Gu · Zhuoyang Liu · Renrui Zhang · Xiaoqi Li · Xiao He · Yandong Guo · Chi-Wing Fu · Shanghang Zhang · Pheng-Ann Heng
Generalized policy and execution efficiency constitute the two critical challenges in robotic manipulation. While recent foundation policies benefit from the common-sense reasoning capabilities of internet-scale pretrained vision-language models (VLMs), they often suffer from low execution frequency. To mitigate this dilemma, dual-system approaches have been proposed to leverage a VLM-based System 2 module for handling high-level decision-making, and a separate System 1 action module for ensuring real-time control. However, existing designs maintain both systems as separate models, limiting System 1 from fully leveraging the rich pretrained knowledge from the VLM-based System 2. In this work, we propose Fast-in-Slow (FiS), a unified dual-system vision-language-action (VLA) model that embeds the System 1 execution module within the VLM-based System 2 by partially sharing parameters. This innovative paradigm not only enables high-frequency execution in System 1, but also facilitates coordination between multimodal reasoning and execution components within a single foundation model of System 2. Given their fundamentally distinct roles within FiS-VLA, we design the two systems to incorporate heterogeneous modality inputs alongside asynchronous operating frequencies, enabling both fast and precise manipulation. To enable coordination between the two systems, a dual-aware co-training strategy is proposed that equips System 1 with action generation capabilities while preserving System 2’s contextual understanding to provide stable latent conditions for System 1. For evaluation, FiS-VLA outperforms previous state-of-the-art methods by 8% in simulation and 11% in real-world tasks in terms of average success rate, while achieving a 117.7 Hz control frequency with action chunk set to eight. Project web page: https://fast-in-slow.github.io.
Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents
Zhizhen Zhang · Lei Zhu · Zhen Fang · Zi Huang · Yadan Luo
Pre-training vision-language representations on human action videos has emerged as a promising approach to reduce reliance on large-scale expert demonstrations for training embodied agents. However, prior methods often employ time con- trastive learning based on goal-reaching heuristics, progressively aligning language instructions from the initial to the final frame. This overemphasis on future frames can result in erroneous vision-language associations, as actions may terminate early or include irrelevant moments in the end. To address this issue, we propose Action Temporal Coherence Learning (AcTOL) to learn ordered and continuous vision-language representations without rigid goal-based constraint. AcTOL treats a video as a continuous trajectory where it (1) contrasts semantic differences be- tween frames to reflect their natural ordering, and (2) imposes a local Brownian bridge constraint to ensure smooth transitions across intermediate frames. Exten- sive imitation learning experiments on both simulated and real robots show that the pretrained features significantly enhance downstream manipulation tasks with high robustness to different linguistic styles of instructions, offering a viable pathway toward generalized embodied agents. Our project page is at https://actol-pretrain.github.io/.
DEAL: Diffusion Evolution Adversarial Learning for Sim-to-Real Transfer
Wentao Xu · Huiqiao Fu · Haoyu Dong · Zhehao Zhou · Chunlin Chen
Training Reinforcement Learning (RL) controllers in simulation offers cost-efficiency and safety advantages. However, the resultant policies often suffer significant performance degradation during real-world deployment due to the reality gap. Previous works like System Identification (Sys-Id) have attempted to bridge this discrepancy by improving simulator fidelity, but encounter challenges including the collapse of high-dimensional parameter identification, low identification accuracy, and unstable convergence dynamics. To address these challenges, we propose a novel Sys-Id framework that combines Diffusion Evolution with Adversarial Learning (DEAL) to iteratively infer physical parameters with limited real-world data, which makes the state transitions between simulation and reality as similar as possible. Specifically, our method iteratively refines physical parameters through a dual mechanism: a discriminator network evaluates the similarity of state transitions between parameterized simulations and target environment as fitness guidance, while diffusion evolution adaptively modulates noise prediction and denoising processes to optimize parameter distributions. We validate DEAL in both simulated and real-world environments. Compared to baseline methods, DEAL demonstrates state-of-the-art stability and identification accuracy in high-dimensional parameter identification tasks, and significantly enhances sim-to-real transfer performance while requiring minimal real-world data.
Gradient Alignment in Physics-informed Neural Networks: A Second-Order Optimization Perspective
Sifan Wang · Ananyae bhartari · Bowen Li · Paris Perdikaris
Physics-informed neural networks (PINNs) have shown significant promise in computational science and engineering, yet they often face optimization challenges and limited accuracy. In this work, we identify directional gradient conflicts during PINN training as a critical bottleneck. We introduce a novel gradient alignment score to systematically diagnose this issue through both theoretical analysis and empirical experiments. Building on these insights, we show that (quasi) second-order optimization methods inherently mitigate gradient conflicts, thereby consistently outperforming the widely used Adam optimizer. Among them, we highlight the effectiveness of SOAP \cite{vyas2024soap} by establishing its connection to Newton’s method. Empirically, SOAP achieves state-of-the-art results on 10 challenging PDE benchmarks, including the first successful application of PINNs to turbulent flows at Reynolds numbers up to 10,000. It yields 2–10x accuracy improvements over existing methods while maintaining computational scalability, advancing the frontier of neural PDE solvers for real-world, multi-scale physical systems. All code and datasets used in this work are publicly available at: \url{https://github.com/PredictiveIntelligenceLab/jaxpi/tree/pirate}. \end{abstract}
SpiderSolver: A Geometry-Aware Transformer for Solving PDEs on Complex Geometries
KAI QI · Fan Wang · Zhewen Dong · Jian Sun
Transformers have demonstrated effectiveness in solving partial differential equations (PDEs). However, extending them to solve PDEs on complex geometries remains a challenge. In this work, we propose SpiderSolver, a geometry-aware transformer that introduces spiderweb tokenization for handling complex domain geometry and irregularly discretized points. Our method partitions the irregular spatial domain into spiderweb-like patches, guided by the domain boundary geometry. SpiderSolver leverages a coarse-grained attention mechanism to capture global interactions across spiderweb tokens and a fine-grained attention mechanism to refine feature interactions between the domain boundary and its neighboring interior points. We evaluate SpiderSolver on PDEs with diverse domain geometries across seven datasets, including cars, airfoils, blood flow in the human thoracic aorta, as well as canonical cases governed by the Navier-Stokes, Darcy flow, elasticity, and plasticity equations. Experimental results demonstrate that SpiderSolver consistently achieves state-of-the-art performance across different datasets and metrics, with better generalization ability in the OOD setting. The code is available at https://github.com/Kai-Qi/SpiderSolver.
GyroSwin: 5D Surrogates for Gyrokinetic Plasma Turbulence Simulations
Fabian Paischer · Gianluca Galletti · William Hornsby · Paul Setinek · Lorenzo Zanisi · Naomi Carey · Stanislas Pamela · Johannes Brandstetter
Nuclear fusion plays a pivotal role in the quest for reliable and sustainable energy production. A major roadblock to viable fusion power is understanding plasma turbulence, which significantly impairs plasma confinement, and is vital for next generation reactor design. Plasma turbulence is governed by the nonlinear gyrokinetic equation, which evolves a 5D distribution function over time. Due to its high computational cost, reduced-order models are often employed in practice to approximate turbulent transport of energy. However, they omit nonlinear effects unique to the full 5D dynamics. To tackle this, we introduce GyroSwin, the first scalable 5D neural surrogate that can model 5D nonlinear gyrokinetic simulations, thereby capturing the physical phenomena neglected by reduced models, while providing accurate estimates of turbulent heat transport. GyroSwin (i) extends hierarchical Vision Transformers to 5D, (ii) introduces cross-attention and integration modules for latent 3D$\leftrightarrow$5D interactions between electrostatic potential fields and the distribution function, and (iii) performs channelwise mode separation inspired by nonlinear physics. We demonstrate that GyroSwin outperforms widely used reduced numerics on heat flux prediction, captures the turbulent energy cascade, and reduces the cost of fully resolved nonlinear gyrokinetics by three orders of magnitude while remaining physically verifiable. GyroSwin shows promising scaling laws, tested up to one billion parameters, paving the way for scalable neural surrogates for gyrokinetic simulations of plasma turbulence.
SymMaP: Improving Computational Efficiency in Linear Solvers through Symbolic Preconditioning
Hong Wang · Jie Wang · Minghao Ma · Haoran Shao · Haoyang Liu
Matrix preconditioning is a critical technique to accelerate the solution of linear systems, where performance heavily depends on the selection of preconditioning parameters. Traditional parameter selection approaches often define fixed constants for specific scenarios. However, they rely on domain expertise and fail to consider the instance-wise features for individual problems, limiting their performance. In contrast, machine learning (ML) approaches, though promising, are hindered by high inference costs and limited interpretability. To combine the strengths of both approaches, we propose a symbolic discovery framework—namely, Symbolic Matrix Preconditioning (SymMaP)—to learn efficient symbolic expressions for preconditioning parameters. Specifically, we employ a neural network to search the high-dimensional discrete space for expressions that can accurately predict the optimal parameters. The learned expression allows for high inference efficiency and excellent interpretability (expressed in concise symbolic formulas), making it simple and reliable for deployment. Experimental results show that SymMaP consistently outperforms traditional strategies across various benchmarks.
Learning non-equilibrium diffusions with Schrödinger bridges: from exactly solvable to simulation-free
Stephen Zhang · Michael Stumpf
We consider the Schrödinger bridge problem which, given ensemble measurements of the initial and final configurations of a stochastic dynamical system and some prior knowledge on the dynamics, aims to reconstruct the "most likely" evolution of the system compatible with the data. Most existing literature assume Brownian reference dynamics, and are implicitly limited to modelling systems driven by the gradient of a potential energy. We depart from this regime and consider reference processes described by a multivariate Ornstein-Uhlenbeck process with generic drift matrix $\mathbf{A} \in \mathbb{R}^{d \times d}$. When $\mathbf{A}$ is asymmetric, this corresponds to a non-equilibrium system in which non-gradient forces are at play: this is important for applications to biological systems, which naturally exist out-of-equilibrium. In the case of Gaussian marginals, we derive explicit expressions that characterise exactly the solution of both the static and dynamic Schrödinger bridge. For general marginals, we propose mvOU-OTFM, a simulation-free algorithm based on flow and score matching for learning an approximation to the Schrödinger bridge. In application to a range of problems based on synthetic and real single cell data, we demonstrate that mvOU-OTFM achieves higher accuracy compared to competing methods, whilst being significantly faster to train.
Local-Global Associative Frames for Symmetry-Preserving Crystal Structure Modeling
haowei hua · Wanyu Lin
Crystal structures are defined by the periodic arrangement of atoms in 3D space, inherently making them equivariant to SO(3) group. A fundamental requirement for crystal property prediction is that the model's output should remain invariant to arbitrary rotational transformations of the input structure. One promising strategy to achieve this invariance is to align the given crystal structure into a canonical orientation with appropriately computed rotations, or called frames. However, existing work either only considers a global frame or solely relies on more advanced local frames based on atoms' local structure. A global frame is too coarse to capture the local structure heterogeneity of the crystal, while local frames may inadvertently disrupt crystal symmetry, limiting their expressivity. In this work, we revisit the frame design problem for crystalline materials and propose a novel approach to construct expressive {\bf S}ymmetry Preserving Frames, dubbed as SPFrame, for modeling crystal structures. Specifically, this local-global associative frame constructs invariant local frames rather than equivariant ones, thereby preserving the symmetry of the crystal. In parallel, it integrates global structural information to construct an equivariant global frame to enforce SO(3) invariance. Extensive experimental results demonstrate that SPFrame consistently outperforms traditional frame construction techniques and existing crystal property prediction baselines across multiple benchmark tasks.
Hybrid Boundary Physics-Informed Neural Networks for Solving Navier-Stokes Equations with Complex Boundary
ChuYu Zhou · Tianyu Li · Chenxi Lan · Rongyu Du · Guoguo Xin · Wei Li · Guoqing Wang · Xun Liu · Hangzhou Yang
Physics-informed neural networks (PINN) have achieved notable success in solving partial differential equations (PDE), yet solving the Navier-Stokes equations (NSE) with complex boundary conditions remains a challenging task. In this paper, we introduce a novel Hybrid Boundary PINN (HB-PINN) method that combines a pretrained network for efficient initialization with a boundary-constrained mechanism. The HB-PINN method features a primary network focused on inner domain points and a distance metric network that enhances predictions at the boundaries, ensuring accurate solutions for both boundary and interior regions. Comprehensive experiments have been conducted on the NSE under complex boundary conditions, including the 2D cylinder wake flow and the 2D blocked cavity flow with a segmented inlet. The proposed method achieves state-of-the-art (SOTA) performance on these benchmark scenarios, demonstrating significantly improved accuracy over existing PINN-based approaches.
Hybrid Latent Representations for PDE Emulation
Ali Can Bekar · Siddhant Agarwal · Christian Hüttig · Nicola Tosi · David Greenberg
For classical PDE solvers, adjusting the spatial resolution and time step offers a trade-off between speed and accuracy. Neural emulators often achieve better speed-accuracy trade-offs by operating accurately on a compact representation of the PDE system. Coarsened PDE fields are a simple and effective representation, but cannot exploit fine spatial scales in the high-fidelity numerical solutions. Alternatively, unstructured latent representations provide efficient autoregressive rollouts, but cannot enforce local interactions or physical laws as inductive biases. To overcome these limitations, we introduce hybrid representations that augment coarsened PDE fields with spatially structured latent variables extracted from high-resolution inputs. Hybrid representations provide efficient rollouts, can be trained on a simple loss defined on coarsened PDE fields, and support hard physical constraints. When predicting fine- and coarse-scale features across multiple PDE emulation tasks, they outperform or match the speed-accuracy trade-offs of the best convolutional, attentional, Fourier operator-based and autoencoding baselines.
Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper
Xinyue Zhu · Binghao Huang · Yunzhu Li
Handheld grippers are increasingly used to collect human demonstrations due to their ease of deployment and versatility. However, most existing designs lack tactile sensing, despite the critical role of tactile feedback in precise manipulation. We present a portable, lightweight gripper with integrated tactile sensors that enables synchronized collection of visual and tactile data in diverse, real-world, and in-the-wild settings. Building on this hardware, we propose a cross-modal representation learning framework that integrates visual and tactile signals while preserving their distinct characteristics. The learning procedure allows the emergence of interpretable representations that consistently focus on contacting regions relevant for physical interactions. When used for downstream manipulation tasks, these representations enable more efficient and effective policy learning, supporting precise robotic manipulation based on multimodal feedback. We validate our approach on fine-grained tasks such as test tube insertion and pipette-based fluid transfer, demonstrating improved accuracy and robustness under external disturbances. Our project page is available at \url{https://binghao-huang.github.io/touchinthe_wild/}.
SAFE: Multitask Failure Detection for Vision-Language-Action Models
Qiao Gu · Yuanliang Ju · Shengxiang Sun · Igor Gilitschenski · Haruki Nishimura · Masha Itkina · Florian Shkurti
While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out of the box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while generalist VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollouts and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, $\pi_0$, and $\pi_0$-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction. More qualitative results and code can be found at the project webpage: https://vla-safe.github.io/
PointMapPolicy: Structured Point Cloud Processing for Multi-Modal Imitation Learning
Xiaogang Jia · Qian Wang · Anrui Wang · Han Wang · Balázs Gyenes · Emiliyan Gospodinov · Xinkai Jiang · Ge Li · Hongyi Zhou · Weiran Liao · Xi Huang · Maximilian Beck · Moritz Reuss · Rudolf Lioutikov · Gerhard Neumann
Robotic manipulation systems benefit from complementary sensing modalities, where each provides unique environmental information. Point clouds capture detailed geometric structure, while RGB images provide rich semantic context. Current point cloud methods struggle to capture fine-grained detail, especially for complex tasks, which RGB methods lack geometric awareness, which hinders their precision and generalization. We introduce PointMapPolicy, a novel approach that conditions diffusion policies on structured grids of points without downsampling. The resulting data type makes it easier to extract shape and spatial relationships from observations, and can be transformed between reference frames. Yet due to their structure in a regular grid, we enable the use of established computer vision techniques directly to 3D data. Using xLSTM as a backbone, our model efficiently fuses the point maps with RGB data for enhanced multi-modal perception. Through extensive experiments on the RoboCasa and CALVIN benchmarks and real robot evaluations, we demonstrate that our method achieves state-of-the-art performance across diverse manipulation tasks. The overview and demos are available on our project page: https://point-map.github.io/Point-Map/
$\textit{Hyper-GoalNet}$: Goal-Conditioned Manipulation Policy Learning with HyperNetworks
Pei Zhou · Wanting Yao · Qian Luo · Xunzhe Zhou · Yanchao Yang
Goal-conditioned policy learning for robotic manipulation presents significant challenges in maintaining performance across diverse objectives and environments. We introduce Hyper-GoalNet, a framework that generates task-specific policy network parameters from goal specifications using hypernetworks. Unlike conventional methods that simply condition fixed networks on goal-state pairs, our approach separates goal interpretation from state processing -- the former determines network parameters while the latter applies these parameters to current observations. To enhance representation quality for effective policy generation, we implement two complementary constraints on the latent space: (1) a forward dynamics model that promotes state transition predictability, and (2) a distance-based constraint ensuring monotonic progression toward goal states. We evaluate our method on a comprehensive suite of manipulation tasks with varying environmental randomization. Results demonstrate significant performance improvements over state-of-the-art methods, particularly in high-variability conditions. Real-world robotic experiments further validate our method's robustness to sensor noise and physical uncertainties. Our code and trained models will be made publicly available.
Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies
Ziye Wang · Li Kang · Yiran Qin · Jiahua Ma · zhanglin peng · LEI BAI · Ruimao Zhang
Despite significant advances in robotic policy generation, effective coordination in embodied multi-agent systems remains a fundamental challenge—particularly in scenarios where agents must balance individual perspectives with global environmental awareness. Existing approaches often struggle to balance fine-grained local control with comprehensive scene understanding, resulting in limited scalability and compromised collaboration quality. In this paper, we present GauDP, a novel Gaussian-image synergistic representation that facilitates scalable, perception-aware imitation learning in multi-agent collaborative systems. Specifically, GauDP reconstructs a globally consistent 3D Gaussian field from local-view RGB images, allowing all agents to dynamically query task-relevant features from a shared scene representation. This design facilitates both fine-grained control and globally coherent behavior without requiring additional sensing modalities. We evaluate GauDP on the RoboFactory benchmark, which includes diverse multi-arm manipulation tasks. Our method achieves superior performance over existing image-based methods and approaches the effectiveness of point-cloud-driven methods, while maintaining strong scalability as the number of agents increases. Extensive ablations and visualizations further demonstrate the robustness and efficiency of our unified local-global perception framework for multi-agent embodied learning.
Human-assisted Robotic Policy Refinement via Action Preference Optimization
Wenke Xia · Yichu Yang · Hongtao Wu · Xiao Ma · Tao Kong · Di Hu
Establishing a reliable and iteratively refined robotic system is essential for deploying real-world applications. While Vision-Language-Action (VLA) models are widely recognized as the foundation model for such robotic deployment, their reliance on offline expert demonstrations critically limits their capacity for post-deployment refinement. To mitigate this limitation, we introduce Action Preference Optimization (APO), a method designed to refine VLA models by human-assisted preference alignment gathered through interaction with environments. This method begins with a human-robot collaboration framework for reliable failure correction and interaction trajectory collection through human intervention. However, directly leveraging these interaction trajectories for preference optimization is non-trivial due to the challenges of irreversible robotic actions and token distribution mismatch. To solve this, APO proposes an adaptive reweighting algorithm with binary desirability signals derived from interaction, empowering VLA models effectively suppress failure-prone actions while enhancing corrective action adaptation. Ultimately, APO equips VLA models with the crucial capability to learn from failure, paving the way for their iterative refinement and reliable deployment in dynamic environments. The experiments conducted in simulation and real-world scenarios prove superior generalization and robustness of our human-assisted framework across a variety of manipulation tasks. We believe this work could bring insights for efficient and stable optimization of VLA models through human-robot collaboration. The code and dataset are released at https://github.com/GeWu-Lab/Action-Preference-Optimization.
A Practical Guide for Incorporating Symmetry in Diffusion Policy
Dian Wang · Boce Hu · Shuran Song · Robin Walters · Robert Platt
Recently, equivariant neural networks for policy learning have shown promising improvements in sample efficiency and generalization, however, their wide adoption faces substantial barriers due to implementation complexity. Equivariant architectures typically require specialized mathematical formulations and custom network design, posing significant challenges when integrating with modern policy frameworks like diffusion-based models. In this paper, we explore a number of straightforward and practical approaches to incorporate symmetry benefits into diffusion policies without the overhead of full equivariant designs. Specifically, we investigate (i) invariant representations via relative trajectory actions and eye-in-hand perception, (ii) integrating equivariant vision encoders, and (iii) symmetric feature extraction with pretrained encoders using Frame Averaging. We first prove that combining eye-in-hand perception with relative or delta action parameterization yields inherent SE(3)-invariance, thus improving policy generalization. We then perform a systematic experimental study on those design choices for integrating symmetry in diffusion policies, and conclude that an invariant representation with equivariant feature extraction significantly improves the policy performance. Our method achieves performance on par with or exceeding fully equivariant architectures while greatly simplifying implementation.
A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search
Arnav Kumar Jain · Vibhakar Mohta · Subin Kim · Atiksh Bhardwaj · Juntao Ren · Yunhai Feng · Sanjiban Choudhury · Gokul Swamy
The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don't know how to recover from it. In this sense, BC is akin to giving the agent the fish -- giving them dense supervision across a narrow set of states -- rather than teaching them to fish: to be able to reason independently about achieving the expert's outcome even when faced with unseen situations at test-time. In response, we explore learning to search (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include (1) a world model and (2) a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction-efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach SAILOR consistently out-performs state-of-the-art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5-10x still leaves a performance gap. We find that SAILOR can identify nuanced failures and is robust to reward hacking. Our code is available at https://github.com/arnavkj1995/SAILOR.
DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding
Yue Jiang · Jichu Li · Yang Liu · Dingkang Yang · Feng Zhou · Quyu Kong
We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components:(1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames;(2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods’ ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. Project page: https://github.com/FRENKIE-CHIANG/DanmakuTPPBench.
MAESTRO : Adaptive Sparse Attention and Robust Learning for Multimodal Dynamic Time Series
Payal Mohapatra · Yueyuan Sui · Akash Pandey · Stephen Xia · Qi Zhu
From clinical healthcare to daily living, continuous sensor monitoring across multiple modalities has shown great promise for real-world intelligent decision-making but also faces various challenges. In this work, we argue for modeling such heterogeneous data sources under the multimodal paradigm and introduce a new framework, MAESTRO. We introduce MAESTRO, a novel framework that overcomes key limitations of existing multimodal learning approaches: (1) reliance on a single primary modality for alignment, (2) pairwise modeling of modalities, and (3) assumption of complete modality observations. These limitations hinder the applicability of these approaches in real-world multimodal time-series settings, where primary modality priors are often unclear, the number of modalities can be large (making pairwise modeling impractical), and sensor failures often result in arbitrary missing observations. At its core, MAESTRO facilitates dynamic intra- and cross-modal interactions based on task relevance, and leverages symbolic tokenization and adaptive attention budgeting to construct long multimodal sequences, which are processed via sparse cross-modal attention. The resulting cross-modal tokens are routed through a sparse Mixture-of-Experts (MoE) mechanism, enabling black-box specialization under varying modality combinations. We evaluate MAESTRO against 10 baselines on four diverse datasets spanning three applications, and observe average relative improvements of 4% and 8% over the best existing multimodal and multivariate approaches, respectively, under complete observations. Under partial observations—with up to 40% of missing modalities—MAESTRO achieves an average 9% improvement. Further analysis also demonstrates the robustness and efficiency of MAESTRO's sparse, modality-aware design for learning from dynamic time series.
Delving into Large Language Models for Effective Time-Series Anomaly Detection
JUN WOO PARK · Kyudan Jung · Dohyun Lee · Hyuck Lee · DAEHOON GWAK · ChaeHun Park · Jaegul Choo · Jaewoong Cho
Recent efforts to apply Large Language Models (LLMs) to time-series anomaly detection (TSAD) have yielded limited success, often performing worse than even simple methods. While prior work has focused solely on downstream performance evaluation, the fundamental question—why do LLMs struggle with TSAD?—has remained largely unexplored. In this paper, we present an in-depth analysis that identifies two core challenges in understanding complex temporal dynamics and accurately localizing anomalies. To address these challenges, we propose a simple yet effective method that combines statistical decomposition with index-aware prompting. Our method outperforms 21 existing prompting strategies on the AnomLLM benchmark, achieving up to a 66.6\% improvement in F1 score. We further compare LLMs with 16 non-LLM baselines on the TSB-AD benchmark, highlighting scenarios where LLMs offer unique advantages via contextual reasoning. Our findings provide empirical insights into how and when LLMs can be effective for TSAD. The code is publicly available at: https://github.com/junwoopark92/LLM-TSAD
Eulerian Neural Network Informed by Chemical Transport for Air Quality Forecasting
Xukai Zhang · Shuliang Wang · Guangyin Jin · Ziqiang Yuan · Hanning Yuan · Sijie Ruan
Air pollution remains one of the most critical environmental challenges globally, posing severe threats to public health, ecological sustainability, and climate governance. While existing physics-based and data-driven models have made progress in air quality forecasting, they often struggle to jointly capture the complex spatiotemporal dynamics and ensure spatial continuity of pollutant distributions. In this study, we introduce CTENet, a novel chemical transport deep learning model that embeds the Advection-Diffusion-Reaction equation into a Physics-Informed Neural Network (PINN) framework using an Eulerian representation to model the spatiotemporal evolution of pollutants. Extensive experiments on two real-world datasets demonstrate that CTENet consistently outperforms state-of-the-art (SOTA) baselines, achieving a remarkable RMSE improvement of 45.8% on the USA dataset and 21.0% on the China dataset.
AliO: Output Alignment Matters in Long-Term Time Series Forecasting
Kwangryeol Park · Jaeho Kim · Seulki Lee
Long-term Time Series Forecasting (LTSF) tasks, which leverage the current data sequence as input to predict the future sequence, have become increasingly crucial in real-world applications such as weather forecasting and planning of electricity consumption. However, state-of-the-art LTSF models often fail to achieve prediction output alignment for the same timestamps across lagged input sequences. Instead, these models exhibit low output alignment, resulting in fluctuation in prediction outputs for the same timestamps, undermining the model's reliability. To address this, we propose AliO (Align Outputs), a novel approach designed to improve the output alignment of LTSF models by reducing the discrepancies between prediction outputs for the same timestamps in both the time and frequency domains. To measure output alignment, we introduce a new metric, TAM (Time Alignment Metric), which quantifies the alignment between prediction outputs, whereas existing metrics such as MSE only capture the distance between prediction outputs and ground truths. Experimental results show that AliO effectively improves the output alignment, i.e., up to 58.2\% in TAM, while maintaining or enhancing the forecasting performance (up to 27.5\%). This improved output alignment increases the reliability of the LTSF models, making them more applicable in real-world scenarios. The code implementation is on an anonymous GitHub repository.
ShapeX: Shapelet-Driven Post Hoc Explanations for Time Series Classification Models
Bosong Huang · Ming Jin · Yuxuan Liang · Johan Barthelemy · Debo Cheng · Qingsong Wen · Chenghao Liu · Shirui Pan
Explaining time series classification models is crucial, particularly in high-stakes applications such as healthcare and finance, where transparency and trust play a critical role. Although numerous time series classification methods have identified key subsequences, known as shapelets, as core features for achieving state-of-the-art performance and validating their pivotal role in classification outcomes, existing post-hoc time series explanation (PHTSE) methods primarily focus on timestep-level feature attribution. These explanation methods overlook the fundamental prior that classification outcomes are predominantly driven by key shapelets. To bridge this gap, we present ShapeX, an innovative framework that segments time series into meaningful shapelet-driven segments and employs Shapley values to assess their saliency. At the core of ShapeX lies the Shapelet Describe-and-Detect (SDD) framework, which effectively learns a diverse set of shapelets essential for classification. We further demonstrate that ShapeX produces explanations which reveal causal relationships instead of just correlations, owing to the atomicity properties of shapelets. Experimental results on both synthetic and real-world datasets demonstrate that ShapeX outperforms existing methods in identifying the most relevant subsequences, enhancing both the precision and causal fidelity of time series explanations.
Improving Time Series Forecasting via Instance-aware Post-hoc Revision
Zhiding Liu · Mingyue Cheng · Guanhao Zhao · Jiqian Yang · Qi Liu · Enhong Chen
Time series forecasting plays a pivotal role in various real-world applications and has attracted significant attention in recent decades. While recent methods have achieved remarkable accuracy by incorporating advanced inductive biases and training strategies, we observe that instance-level variations remain a significant challenge. These variations—stemming from distribution shifts, missing data, and long-tail patterns—often lead to suboptimal forecasts for specific instances, even when overall performance appears strong. To address this issue, we propose a model-agnostic framework, PIR, designed to enhance forecasting performance through Post-forecasting Identification and Revision. Specifically, PIR first identifies biased forecast instances by estimating their predictive accuracy. Based on this, the framework revises the forecasts using contextual information, including covariates and historical time series, from both local and global perspectives in a post-processing fashion. Extensive experiments on real-world datasets with mainstream forecasting models demonstrate that PIR effectively mitigates instance-level errors and significantly improves forecasting reliability.
FlowMixer: A Depth-Agnostic Neural Architecture for Interpretable Spatiotemporal Forecasting
Fares Mehouachi · Saif Eddin Jabari
We introduce FlowMixer, a single-layer neural architecture that leverages constrained matrix operations to model structured spatiotemporal patterns with enhanced interpretability. FlowMixer incorporates non-negative matrix mixing layers within a reversible mapping framework—applying transforms before mixing and their inverses afterward. This shape-preserving design enables a Kronecker-Koopman eigenmodes framework that bridges statistical learning with dynamical systems theory, providing interpretable spatiotemporal patterns and facilitating direct algebraic manipulation of prediction horizons without retraining. The architecture's semi-group property enables this single layer to mathematically represent any depth through composition, eliminating depth search entirely. Extensive experiments across diverse domains demonstrate FlowMixer's long-horizon forecasting capabilities while effectively modeling physical phenomena such as chaotic attractors and turbulent flows. Our results achieve performance matching state-of-the-art methods while offering superior interpretability through directly extractable eigenmodes. This work suggests that architectural constraints can simultaneously maintain competitive performance and enhance mathematical interpretability in neural forecasting systems.
Topology-Aware Conformal Prediction for Stream Networks
Jifan Zhang · Fangxin Wang · Zihe Song · Philip S Yu · Kaize Ding · Shixiang Zhu
Stream networks, a unique class of spatiotemporal graphs, exhibit complex directional flow constraints and evolving dependencies, making uncertainty quantification a critical yet challenging task. Traditional conformal prediction methods struggle in this setting due to the need for joint predictions across multiple interdependent locations and the intricate spatio-temporal dependencies inherent in stream networks. Existing approaches either neglect dependencies, leading to overly conservative predictions, or rely solely on data-driven estimations, failing to capture the rich topological structure of the network. To address these challenges, we propose Spatio-Temporal Adaptive Conformal Inference (STACI), a novel framework that integrates network topology and temporal dynamics into the conformal prediction framework. STACI introduces a topology-aware nonconformity score that respects directional flow constraints and dynamically adjusts prediction sets to account for temporal distributional shifts. We provide theoretical guarantees on the validity of our approach and demonstrate its superior performance on both synthetic and real-world datasets. Our results show that STACI effectively balances prediction efficiency and coverage, outperforming existing conformal prediction methods for stream networks.
Let's Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLMs
Zhiyi Lyu · Jianguo Huang · Yanchen Deng · Steven Hoi · Bo An
Large Language Models (LLMs) with inference-time scaling techniques show promise for code generation, yet face notable efficiency and scalability challenges. Construction-based tree-search methods suffer from rapid growth in tree size, high token consumption, and lack of anytime property. In contrast, improvement-based methods offer better performance but often struggle with uninformative reward signals and inefficient search strategies. In this work, we propose $\textbf{ReLoc}$, a unified local search framework which effectively performs step-by-step code revision. Specifically, ReLoc explores a series of local revisions through four key algorithmic components: initial code drafting, neighborhood code generation, candidate evaluation, and incumbent code updating, each of which can be instantiated with specific decision rules to realize different local search algorithms such as Hill Climbing (HC) or Genetic Algorithm (GA). Furthermore, we develop a specialized revision reward model that evaluates code quality based on revision distance to produce fine-grained preferences that guide the local search toward more promising candidates. Finally, our extensive experimental results demonstrate that our approach achieves superior performance across diverse code generation tasks, significantly outperforming both construction-based tree search as well as the state-of-the-art improvement-based code generation methods.
CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations
Guangyi Chen · Yunlong Deng · Peiyuan Zhu · Yan Li · Yifan Shen · Zijian Li · Kun Zhang
Causal Representation Learning (CRL) aims to uncover the data-generating process and identify the underlying causal variables and relations, whose evaluation remains inherently challenging due to the requirement of known ground-truth causal variables and causal structure. Existing evaluations often rely on either simplistic synthetic datasets or downstream performance on real-world tasks, generally suffering a dilemma between realism and evaluative precision. In this paper, we introduce a new benchmark for CRL using high-fidelity simulated visual data that retains both realistic visual complexity and, more importantly, access to ground-truth causal generating processes. The dataset comprises around 200 thousand images and 3 million video frames across 24 sub-scenes in four domains: static image generation, dynamic physical simulations, robotic manipulations, and traffic situation analysis. These scenarios range from static to dynamic settings, simple to complex structures, and single to multi-agent interactions, offering a comprehensive testbed that hopefully bridges the gap between rigorous evaluation and real-world applicability. In addition, we provide flexible access to the underlying causal structures, allowing users to modify or configure them to align with the required assumptions in CRL, such as available domain labels, temporal dependencies, or intervention histories. Leveraging this benchmark, we evaluated representative CRL methods across diverse paradigms and offered empirical insights to assist practitioners and newcomers in choosing or extending appropriate CRL frameworks to properly address specific types of real problems that can benefit from the CRL perspective. Welcome to visit our: Project page: https://causal-verse.github.io/ , Dataset: https://huggingface.co/CausalVerse
AutoData: A Multi-Agent System for Open Web Data Collection
Tianyi Ma · Yiyue Qian · Zheyuan Zhang · Zehong Wang · Xiaoye Qian · Feifan Bai · Yifan Ding · Xuwei Luo · Shinan Zhang · Keerthiram Murugesan · Chuxu Zhang · Yanfang Ye
The exponential growth of data-driven systems and AI technologies has intensified the demand for high-quality web-sourced datasets. While existing datasets have proven valuable, conventional web data collection approaches face significant limitations in terms of human effort and scalability. Current data collecting solutions fall into two categories: wrapper-based methods that struggle with adaptability and reproducibility, and large language model (LLM)-based approaches that incur substantial computational and financial costs. To address these challenges, we propose AutoData, a novel multi-agent system for Automated web Data collection, that requires minimal human intervention, i.e., only necessitating a natural language instruction specifying the desired dataset. In addition, AutoData is designed for a robust multi-agent architecture, featuring a novel oriented message hypergraph coordinated by a central task manager, to efficiently organize agents across research and development squads. Besides, we introduce a novel hypergraph cache system to advance the multi-agent collaboration process that enables efficient automated data collection and mitigates the token cost issues prevalent in existing LLM-based systems. Moreover, we introduce Instruct2DS, a new benchmark dataset supporting live data collection from web sources across three domains: academic, finance, and sports. Comprehensive evaluations over Instruct2DS and three existing benchmark datasets demonstrate AutoData's superior performance compared to baseline methods. Case studies on challenging tasks such as picture book collection and paper extraction from surveys further validate its applicability.
Photography Perspective Composition: Towards Aesthetic Perspective Recommendation
Lujian Yao · Siming Zheng · Xinbin Yuan · Zhuoxuan Cai · Pu Wu · Jinwei Chen · Bo Li · Peng-Tao Jiang
Traditional photography composition approaches are dominated by 2D cropping-based methods. However, these methods fall short when scenes contain poorly arranged subjects. Professional photographers often employ perspective adjustment as a form of 3D recomposition, modifying the projected 2D relationships between subjects while maintaining their actual spatial positions to achieve better compositional balance. Inspired by this artistic practice, we propose photography perspective composition (PPC), extending beyond traditional cropping-based methods. However, implementing the PPC faces significant challenges: the scarcity of perspective transformation datasets and undefined assessment criteria for perspective quality. To address these challenges, we present three key contributions: (1) An automated framework for building PPC datasets through expert photographs. (2) A video generation approach that demonstrates the transformation process from less favorable to aesthetically enhanced perspectives. (3) A perspective quality assessment (PQA) model constructed based on human performance. Our approach is concise and requires no additional prompt instructions or camera trajectories, helping and guiding ordinary users to enhance their composition skills.
Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay
Yifan Sun · Jingyan Shen · Yibin Wang · Tianyu Chen · Zhendong Wang · Mingyuan Zhou · Huan Zhang
Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism inspired by experience replay in traditional RL. This technique reuses recent rollouts, lowering per-step computation while maintaining stable updates. Experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 23% to 62% while reaching the same level of performance as the original GRPO algorithm. Our code repository is available at https://github.com/ASTRAL-Group/data-efficient-llm-rl/.
Mitigating Instability in High Residual Adaptive Sampling for PINNs via Langevin Dynamics
Minseok Jeong · Giup Seo · Euiseok Hwang
Recently, physics-informed neural networks (PINNs) have gained attention in the scientific community for their potential to solve partial differential equations (PDEs). However, they face challenges related to resource efficiency and slow convergence. Adaptive sampling methods, which prioritize collocation points with high residuals, improve both efficiency and accuracy. However, these methods often neglect points with medium or low residuals, which can affect stability as the complexity of the model increases. In this paper, we investigate this limitation and show that high residual-based approaches require stricter learning rate bounds to ensure stability. To address this, we propose a Langevin dynamics-based Adaptive Sampling (LAS) framework that is robust to various learning rates and model complexities. Our experiments demonstrate that the proposed method outperforms existing approaches in terms of relative $L^{2}$ error, and stability across a range of environments, including high-dimensional PDEs where Monte Carlo integration-based methods typically suffer from instability.
MIHC: Multi-View Interpretable Hypergraph Neural Networks with Information Bottleneck for Chip Congestion Prediction
Zeyue Zhang · Heng Ping · Peiyu Zhang · Nikos Kanakaris · Xiaoling LU · Paul Bogdan · Xiongye Xiao
With AI advancement and increasing circuit complexity, efficient chip design through Electronic Design Automation (EDA) is critical. Fast and accurate congestion prediction in chip layout and routing can significantly enhance automated design performance. Existing congestion modeling methods are limited by (i) ineffective processing and fusion of multi-view circuit data information, and (ii) insufficient reliability and interpretability in the prediction process. To address these challenges, We propose Multi-view Interpretable Hypergraph for Chip (MIHC), a trustworthy 'multi-view hypergraph neural network'-based framework that (i) processes both graph and image information in unified hypergraph representations, capturing topological and geometric circuit data, and (ii) implements a novel subgraph Information Bottleneck mechanism identifying critical congestion-correlated regions to guide predictions. This represents the first attempt to incorporate such interpretability into congestion prediction through informative graph reasoning. Experiments show our model reduces NMAE by 16.67% and 8.57% in cell-based and grid-based predictions on ISPD2015, and 5.26% and 2.44% on CircuitNet-N28, respectively, compared to state-of-the-art methods. Rigorous cross-design generalization experiments further validate our method’s capability to handle entirely unseen circuit designs.
SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks
Hwiwon Lee · Ziqi Zhang · Hanxiao Lu · LINGMING ZHANG
Rigorous security-focused evaluation of large language model (LLM) agents is imperative for establishing trust in their safe deployment throughout the software development lifecycle. However, existing benchmarks largely rely on synthetic challenges or simplified vulnerability datasets that fail to capture the complexity and ambiguity encountered by security engineers in practice. We introduce SEC-bench, the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks. SEC-bench employs a novel multi-agent scaffold that automatically constructs code repositories with harnesses, reproduces vulnerabilities in isolated environments, and generates gold patches for reliable evaluation. Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only $0.87 per instance. Using SEC-bench, we implement two critical software security tasks to rigorously evaluate LLM agents' capabilities: proof-of-concept (PoC) generation and vulnerability patching. A comprehensive evaluation of state-of-the-art LLM code agents reveals significant performance gaps, achieving at most 18.0% success in PoC generation and 34.0% in vulnerability patching on our complete dataset. These results highlight the crucial steps needed toward developing LLM agents that are more practical, intelligent, and autonomous for security engineering.
How to Auto-optimize Prompts for Domain Tasks? Adaptive Prompting and Reasoning through Evolutionary Domain Knowledge Adaptation
Yang Zhao · Pu Wang · Hao Frank Yang
Designing optimal prompts and reasoning processes for large language models (LLMs) on domain-specific tasks is both necessary and challenging in real-world applications. Determining how to integrate domain knowledge, enhance reasoning efficiency, and even provide domain experts with refined knowledge integration hints are particularly crucial yet unresolved tasks. In this research, we propose Evolutionary Graph Optimization for Prompting (EGO-Prompt), an automated framework to designing better prompts, efficient reasoning processes and providing enhanced causal-informed process. EGO-Prompt begins with a general prompt and fault-tolerant initial Semantic Causal Graph (SCG) descriptions, constructed by human experts, which is then automatically refined and optimized to guide LLM reasoning. Recognizing that expert-defined SCGs may be partial or imperfect and that their optimal integration varies across LLMs, EGO-Prompt integrates a novel causal-guided textual gradient process in two steps: first, generating nearly deterministic reasoning guidance from the SCG for each instance, and second, adapting the LLM to effectively utilize the guidance alongside the original input. The iterative optimization algorithm further refines both the SCG and the reasoning mechanism using textual gradients with ground-truth. We tested the framework on real-world public health, transportation and human behavior tasks. EGO-Prompt achieves 7.32\%–12.61\% higher F1 than cutting-edge methods, and allows small models to reach the performence of larger models at under 20\% of the original cost. It also outputs a refined, domain-specific SCG that improves interpretability.
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
Mislav Balunovic · Jasper Dekoninck · Ivo Petrov · Nikola Jovanović · Martin Vechev
The rapid advancement of reasoning capabilities in large language models (LLMs) has led to notable improvements on mathematical benchmarks. However, many of the most commonly used evaluation datasets (e.g., AIME 2024) are widely available online, making it difficult to disentangle genuine reasoning from potential memorization. Furthermore, these benchmarks do not evaluate proof-writing capabilities, which are crucial for many mathematical tasks.To address this, we introduce MathArena, a new benchmark based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems that can be used for real-time evaluation of LLMs. By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination.Using this framework, we find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as CMIMC 2025, demonstrate impressive reasoning capabilities in top-performing models.MathArena is also the first benchmark for proof-writing capabilities. On IMO 2025, top models achieve slightly less than 40\%, demonstrating both notable progress and significant room for improvement.So far, we have evaluated over $50$ models across seven competitions, totaling $162$ problems. As an evolving benchmark, MathArena will continue to track the progress of LLMs on newly released competitions, ensuring rigorous and up-to-date evaluation of mathematical reasoning.
AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems
Yu Shang · Peijie Liu · Yuwei Yan · Zijing Wu · Leheng Sheng · Yuanqing Yu · Chumeng Jiang · An Zhang · Fengli Xu · Yu Wang · Min Zhang · Yong Li
The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs’ advanced reasoning and role-playing capabilities to enable autonomous, adaptive decision-making. Unlike traditional recommendation approaches, agentic recommender systems can dynamically gather and interpret user-item interactions from complex environments, generating robust recommendation strategies that generalize across diverse scenarios. However, the field currently lacks standardized evaluation protocols to systematically assess these methods. To address this critical gap, we propose: (1) an interactive textual recommendation simulator incorporating rich user and item metadata and three typical evaluation scenarios (classic, evolving-interest, and cold-start recommendation tasks); (2) a unified modular framework for developing agentic recommender systems; and (3) the first comprehensive benchmark comparing over 10 classical and agentic recommendation methods. Our findings demonstrate the superiority of agentic systems and establish actionable design guidelines for their core components. The benchmark environment has been rigorously validated through an open challenge and remains publicly available with a maintained leaderboard at https://tsinghua-fib-lab.github.io/AgentSocietyChallenge/pages/overview.html. The benchmark is available at: https://huggingface.co/datasets/SGJQovo/AgentRecBench.
Satellites Reveal Mobility: A Commuting Origin-destination Flow Generator for Global Cities
Can Rong · Xin Zhang · Yanxin Xi · HONGJIE SUI · Jingtao Ding · Yong Li
Commuting Origin-destination (OD) flows, capturing daily population mobility of citizens, are vital for sustainable development across cities around the world. However, it is challenging to obtain the data due to the high cost of travel surveys and privacy concerns. Surprisingly, we find that satellite imagery, publicly available across the globe, contains rich urban semantic signals to support high-quality OD flow generation, with over 98\% expressiveness of traditional multisource hard-to-collect urban sociodemographic, economics, land use, and point of interest data. This inspires us to design a novel data generator, GlODGen (Global-scale OriginDestination Flow Generator), which can generate OD flow data for any cities of interest around the world. Specifically, GlODGen first leverages Vision-Language Geo-Foundation Models to extract urban semantic signals related to human mobility from satellite imagery. These features are then combined with population data to form region-level representations, which are used to generate OD flows via graph diffusion models. Extensive experiments on 4 continents and 6 representative cities show that GlODGen has great generalizability across diverse urban environments on different continents and can generate OD flow data for global cities highly consistent with real-world mobility data. We implement GlODGen as an automated tool, seamlessly integrating data acquisition and curation, urban semantic feature extraction, and OD flow generation together. It has been released at https://github.com/tsinghua-fib-lab/generate-od-pubtools.
SGN: Shifted Window-Based Hierarchical Variable Grouping for Multivariate Time Series Classification
Zenan Ying · Zhi Zheng · huijun hou · Tong Xu · Qi Liu · Jinke Wang · Wei Chen
Multivariate time series (MTS) classification has attracted increasing attention across various domains. Existing methods either decompose MTS into separate univariate series, ignoring inter-variable dependencies, or jointly model all variables, which may lead to over-smoothing and loss of semantic structure. These limitations become particularly pronounced when dealing with complex and heterogeneous variable types. To address these challenges, we propose SwinGroupNet (SGN), which explores a novel perspective for constructing variable interaction and temporal dependency. Specifically, SGN processes multi-scale time series using (1) Variable Group Embedding (VGE), which partitions variables into groups and performs independent group-wise embedding; (2) Multi-Scale Group Window Mixing (MGWM), which reconstructs variable interactions by modeling both intra-group and inter-group dependencies while extracting multi-scale temporal features; and (3) Periodic Window Shifting and Merging (PWSM), which exploits inherent periodic patterns to enable hierarchical temporal interaction and feature aggregation. Extensive experiments on diverse benchmark datasets from multiple domains demonstrate that SGN consistently achieves state-of-the-art performance, with an average improvement of 4.2% over existing methods. We release the source code at https://anonymous.4open.science/r/SGN.
Investigating Hallucinations of Time Series Foundation Models through Signal Subspace Analysis
Yufeng Zou · Zijian Wang · Diego Klabjan · Han Liu
Times series foundation models (TSFMs) have emerged as a promising paradigm for time series analysis and forecasting, showing remarkable generalization performance across different domains. While efforts have been made on hallucinations of foundation models, the hallucinations of TSFMs have been underexplored. In this paper, we formally define TSFM hallucinations in the zero-shot forecasting setting by examining whether a generated forecast exhibits different dynamics from those of the context. Our study reveals that TSFM hallucinations primarily stem from the loss of context information in hidden states during forward propagation. As such, we propose methodologies to identify signal subspaces in TSFMs and magnify their information through intervention. Extensive experiments demonstrate that our proposed intervention approach effectively mitigates hallucinations and improves forecasting performance. Furthermore, the signal strength measure we compute from signal subspaces has strong predictive power of hallucinations and forecasting performance of the model. Our work contributes to deeper understanding of TSFM trustworthiness that could foster future research in this direction.
On the Integration of Spatial-Temporal Knowledge: A Lightweight Approach to Atmospheric Time Series Forecasting
Yisong Fu · Fei Wang · Zezhi Shao · Boyu Diao · Lin Wu · Zhulin An · Chengqing Yu · Yujie Li · Yongjun Xu
Transformers have gained attention in atmospheric time series forecasting (ATSF) for their ability to capture global spatial-temporal correlations. However, their complex architectures lead to excessive parameter counts and extended training times, limiting their scalability to large-scale forecasting. In this paper, we revisit ATSF from a theoretical perspective of atmospheric dynamics and uncover a key insight: spatial-temporal position embedding (STPE) can inherently model spatial-temporal correlations even without attention mechanisms. Its effectiveness arises from integrating geographical coordinates and temporal features, which are intrinsically linked to atmospheric dynamics. Based on this, we propose STELLA, a Spatial-Temporal knowledge Embedded Lightweight modeL for ASTF, utilizing only STPE and an MLP architecture in place of Transformer layers. With 10k parameters and one hour of training, STELLA achieves superior performance on five datasets compared to other advanced methods. The paper emphasizes the effectiveness of spatial-temporal knowledge integration over complex architectures, providing novel insights for ATSF.
Glocal Information Bottleneck for Time Series Imputation
Jie Yang · Kexin Zhang · Guibin Zhang · Philip S Yu · Kaize Ding
Time Series Imputation (TSI), which aims to recover missing values in temporal data, remains a fundamental challenge due to the complex and often high-rate missingness in real-world scenarios. Existing models typically optimize the point-wise reconstruction loss, focusing on recovering numerical values (local information). However, we observe that under high missing rates, these models still perform well in the training phase yet produce poor imputations and distorted latent representation distributions (global information) in the inference phase. This reveals a critical optimization dilemma: current objectives lack global guidance, leading models to overfit local noise and fail to capture global information of the data. To address this issue, we propose a new training paradigm, Glocal Information Bottleneck (Glocal-IB). Glocal-IB is model-agnostic and extends the standard IB framework by introducing a Global Alignment loss, derived from a tractable mutual information approximation. This loss aligns the latent representations of masked inputs with those of their originally observed counterparts. It helps the model retain global structure and local details while suppressing noise caused by missing values, giving rise to better generalization under high missingness. Extensive experiments on nine datasets confirm that Glocal-IB leads to consistently improved performance and aligned latent representations under missingness. Our code implementation is available in https://github.com/Muyiiiii/NeurIPS-25-Glocal-IB.
Selective Learning for Deep Time Series Forecasting
Yisong Fu · Zezhi Shao · Chengqing Yu · Yujie Li · Zhulin An · Qi Wang · Yongjun Xu · Fei Wang
Benefiting from high capacity for capturing complex temporal patterns, deep learning (DL) has significantly advanced time series forecasting (TSF). However, deep models tend to suffer from severe overfitting due to the inherent vulnerability of time series to noise and anomalies. The prevailing DL paradigm uniformly optimizes all timesteps through the MSE loss and learns those uncertain and anomalous timesteps without difference, ultimately resulting in overfitting. To address this, we propose a novel selective learning strategy for deep TSF. Specifically, selective learning screens a subset of the whole timesteps to calculate the MSE loss in optimization, guiding the model to focus on generalizable timesteps while disregarding non-generalizable ones. Our framework introduces a dual-mask mechanism to target timesteps: (1) an uncertainty mask leveraging residual entropy to filter uncertain timesteps, and (2) an anomaly mask employing residual lower bound estimation to exclude anomalous timesteps. Extensive experiments across eight real-world datasets demonstrate that selective learning can significantly improve the predictive performance for typical state-of-the-art deep models, including 37.4% MSE reduction for Informer, 8.4% for TimesNet, and 6.5% for iTransformer.
Learning to Factorize Spatio-Temporal Foundation Models
Siru Zhong · Junjie Qiu · Yangyu Wu · Xingchen Zou · Zhongwen Rao · Bin Yang · Chenjuan Guo · Hao Xu · Yuxuan Liang
Spatio-Temporal Foundation Models (STFMs) promise zero/few-shot generalization across various datasets, yet joint spatio-temporal pretraining is computationally prohibitive and struggles with domain-specific spatial correlations. To this end, we introduce FactoST, a factorized STFM that decouples universal temporal pretraining from spatio-temporal adaptation. The first stage pretrains a space-agnostic backbone with multi-frequency reconstruction and domain-aware prompting, capturing cross-domain temporal regularities at low computational cost. The second stage freezes or further fine-tunes the backbone and attaches an adapter that fuses spatial metadata, sparsifies interactions, and aligns domains with continual memory replay. Extensive forecasting experiments reveal that, in few-shot setting, FactoST reduces MAE by up to 46.4% versus UniST, uses 46.2% fewer parameters, and achieves 68% faster inference than OpenCity, while remaining competitive with expert models. We believe this factorized view offers a practical and scalable path toward truly universal STFMs. The code will be released upon notification.
PMLF: A Physics-Guided Multiscale Loss Framework for Structurally Heterogeneous Time Series
Xinghong Chen · Weilin Wu · Kunping Yang · Guannan Chen
Forecasting real-world time series requires modeling both short-term fluctuations and long-term evolutions, as these signals typically exhibit multiscale temporal structures. A core challenge lies in reconciling such dynamics: high-frequency seasonality demands local precision, while low-frequency trends require global robustness. However, most existing methods adopt a unified loss function across all temporal components, overlooking their structural differences. This misalignment often causes overfitting to seasonal noise or underfitting of long-term trends, leading to suboptimal forecasting performance. To address this issue, we propose a Physics-guided Multiscale Loss Framework (PMLF) that decomposes time series into seasonal and trend components and assigns component-specific objectives grounded in the distinct energy responses of oscillatory and drift dynamics. Specifically, we assign a quadratic loss to seasonal components, reflecting the quadratic potential energy profile of molecular vibration, while a logarithmic loss is used for trend components to capture the sublinear energy profile of molecular drift under sustained external forces. Furthermore, we introduce a softmax-based strategy that adaptively balances the unequal energetic responses of these two physical processes. Experiments on different public benchmarks show that PMLF improves the performance of diverse baselines, demonstrating the effectiveness of physics-guided loss design in modeling structural heterogeneity in time series forecasting.
Learning Counterfactual Outcomes Under Rank Preservation
Peng Wu · Haoxuan Li · Chunyuan Zheng · Yan Zeng · Jiawei Chen · Yang Liu · Ruocheng Guo · Kun Zhang
Counterfactual inference aims to estimate the counterfactual outcome at the individual level given knowledge of an observed treatment and the factual outcome, with broad applications in fields such as epidemiology, econometrics, and management science. Previous methods rely on a known structural causal model (SCM) or assume the homogeneity of the exogenous variable and strict monotonicity between the outcome and exogenous variable. In this paper, we propose a principled approach for identifying and estimating the counterfactual outcome. We first introduce a simple and intuitive rank preservation assumption to identify the counterfactual outcome without relying on a known structural causal model. Building on this, we propose a novel ideal loss for theoretically unbiased learning of the counterfactual outcome and further develop a kernel-based estimator for its empirical estimation. Our theoretical analysis shows that the rank preservation assumption is not stronger than the homogeneity and strict monotonicity assumptions, and shows that the proposed ideal loss is convex, and the proposed estimator is unbiased. Extensive semi-synthetic and real-world experiments are conducted to demonstrate the effectiveness of the proposed method.
A Cautionary Tale on Integrating Studies with Disparate Outcome Measures for Causal Inference
Harsh Parikh · Trang Nguyen · Elizabeth Stuart · Kara Rudolph · Caleb Miles
Data integration approaches are increasingly used to enhance the efficiency and generalizability of studies. However, a key limitation of these methods is the assumption that outcome measures are identical across datasets -- an assumption that often does not hold in practice. Consider the following opioid use disorder (OUD) studies: the XBOT trial and the POAT study, both evaluating the effect of medications for OUD on withdrawal symptom severity (not the primary outcome of either trial). While XBOT measures withdrawal severity using the subjective opiate withdrawal scale, POAT uses the clinical opiate withdrawal scale. We analyze this realistic yet challenging setting where outcome measures differ across studies and where neither study records both types of outcomes. Our paper studies whether and when integrating studies with disparate outcome measures leads to efficiency gains. We introduce three sets of assumptions -- with varying degrees of strength -- linking both outcome measures. Our theoretical and empirical results highlight a cautionary tale: integration can improve asymptotic efficiency only under the strongest assumption linking the outcomes. However, misspecification of this assumption leads to bias. In contrast, a milder assumption may yield finite-sample efficiency gains, yet these benefits diminish as sample size increases. We illustrate these trade-offs via a case study integrating the XBOT and POAT datasets to estimate the comparative effect of two medications for opioid use disorder on withdrawal symptoms. By systematically varying the assumptions linking the SOW and COW scales, we show potential efficiency gains and the risks of bias. Our findings emphasize the need for careful assumption selection when fusing datasets with differing outcome measures, offering guidance for researchers navigating this common challenge in modern data integration.
An Analysis of Causal Effect Estimation using Outcome Invariant Data Augmentation
Uzair Akbar · Niki Kilbertus · Hao Shen · Krikamol Muandet · Bo Dai
The technique of data augmentation (DA) is often used in machine learning for regularization purposes to better generalize under i.i.d. settings. In this work, we present a unifying framework with topics in causal inference to make a case for the use of DA beyond just the i.i.d. setting, but for generalization across interventions as well. Specifically, we argue that when the outcome generating mechanism is invariant to our choice of DA, then such augmentations can effectively be thought of as interventions on the treatment generating mechanism itself. This can potentially help to reduce bias in causal effect estimation arising from hidden confounders. In the presence of such unobserved confounding we typically make use of instrumental variables (IVs)—sources of treatment randomization that are conditionally independent of the outcome. However, IVs may not be as readily available as DA for many applications, which is the main motivation behind this work. By appropriately regularizing IV based estimators, we introduce the concept of IV-like (IVL) regression for mitigating confounding bias and improving predictive performance across interventions even when certain IV properties are relaxed. Finally, we cast parameterized DA as an IVL regression problem and show that when used in composition can simulate a worst-case application of such DA, further improving performance on causal estimation and generalization tasks beyond what simple DA may offer. This is shown both theoretically for the population case and via simulation experiments for the finite sample case using a simple linear example. We also present real data experiments to support our case.
Non-Stationary Structural Causal Bandits
Yeahoon Kwon · Yesong Choe · Soungmin Park · Neil Dhir · Sanghack Lee
We study the problem of sequential decision-making in environments governed by evolving causal mechanisms. Prior work on structural causal bandits--formulations that integrate causal graphs into multi-armed bandit problems to guide intervention selection--has shown that leveraging the causal structure can reduce unnecessary interventions by identifying possibly-optimal minimal intervention sets (POMISs). However, such formulations fall short in dynamic settings where reward distributions may vary over time, as their static, hence myopic, nature focuses on immediate rewards and overlooks the long-term effects of interventions. In this work, we propose a non-stationary structural causal bandit framework that leverages temporal structural causal models to capture evolving dynamics over time. We characterize how interventions propagate over time by developing graphical tools and assumptions, which form the basis for identifying non-myopic intervention strategies. Within this framework, we devise POMIS$^+$, which captures the existence of variables that contribute to maximizing both immediate and long-term rewards. Our framework provides a principled way to reason about temporally-aware interventions by explicitly modeling information propagation across time. Empirical results validate the effectiveness of our approach, demonstrating improved performance over myopic baselines.
Estimation of Treatment Effects in Extreme and Unobserved Data
Jiyuan Tan · Vasilis Syrgkanis · Jose Blanchet
Causal effect estimation seeks to determine the impact of an intervention from observational data. However, the existing causal inference literature primarily addresses treatment effects on frequently occurring events. But what if we are interested in estimating the effects of a policy intervention whose benefits, while potentially important, can only be observed and measured in rare yet impactful events, such as extreme climate events? The standard causal inference methodology is not designed for this type of inference since the events of interest may be scarce in the observed data and some degree of extrapolation is necessary. Extreme Value Theory (EVT) provides methodologies for analyzing statistical phenomena in such extreme regimes. We introduce a novel framework for assessing treatment effects in extreme data to capture the causal effect at the occurrence of rare events of interest. In particular, we employ the theory of multivariate regular variation to model extremities. We develop a consistent estimator for extreme treatment effects and present a rigorous non-asymptotic analysis of its performance. We illustrate the performance of our estimator using both synthetic and semi-synthetic data.
In this paper, we explore the universal properties underlying causal inference by formulating it in terms of a topos. More concretely, we introduce topos causal models (TCMs), a strict generalization of the popular structural causal models (SCMs). A topos category has several properties that make it attractive: a general theory for how to combine local functions that define ``independent causal mechanisms" into a consistent global function building on the theory of sheaves in a topos; a generic way to define causal interventions using a subobject classifier in a topos category; and finally, an internal logical language for causal and counterfactual reasoning that emerges from the topos itself. A striking characteristic of subobject classifiers is that they induce an intuitionistic logic, whose semantics is based on the partially ordered lattice of subobjects. We show that the underlying subobject classifier for causal inference is not Boolean in general, but forms a Heyting algebra. We define the internal Mitchell-B\'enabou language, a typed local set theory, associated with causal models, and its associated Kripke-Joyal intuitionistic semantics. We prove a universal property of TCM, namely that any causal functor mapping decomposable structure to probabilistic semantics factors uniquely through a TCM representation.
Score-informed Neural Operator for Enhancing Ordering-based Causal Discovery
Jiyeon Kang · Songseong Kim · Chanhui Lee · Doyeong Hwang · Joanie Chung · Yunkyung Ko · Sumin Lee · Sungwoong Kim · Sungbin Lim
Ordering-based approaches to causal discovery identify topological orders of causal graphs, providing scalable alternatives to combinatorial search methods. Under the Additive Noise Models (ANMs) assumption, recent causal ordering methods based on score matching require an accurate estimation of the Hessian diagonal of the log-densities. However, previous approaches mainly use Stein gradient estimators, which are computationally expensive and memory-intensive. Although DiffAN addresses these limitations by substituting kernel-based estimates with diffusion models, it remains numerically unstable due to the second-order derivatives of score models. To alleviate these problems, we propose Score-informed Neural Operator (SciNO), a probabilistic generative model in smooth function spaces designed to stably approximate the Hessian diagonal and to preserve structural information during the score modeling. Empirical results show that SciNO reduces order divergence by 42.7% on synthetic graphs and by 31.5% in real-world datasets on average compared to DiffAN, while maintaining memory efficiency and scalability. Furthermore, we propose a probabilistic control algorithm for causal reasoning with autoregressive models that integrates SciNO's probability estimates with autoregressive model priors, enabling reliable data-driven causal ordering informed by semantic information. Consequently, the proposed method enhances causal reasoning abilities of LLMs without additional fine-tuning or prompt engineering.
Meta-D2AG: Causal Graph Learning with Interventional Dynamic Data
Tian Gao · Songtao Lu · Junkyu Lee · Elliot Nelson · Debarun Bhattacharjya · Yue Yu · Miao Liu
Causal discovery in the form of a directed acyclic graph (DAG) for dynamic time series data has been widely studied in various applications. Much of the existing work has focused on observational, offline, and/or stationary settings. In this work, we propose a dynamic DAG discovery algorithm, Meta-D$^2$AG, based on online meta-learning. Meta-D$^2$AG is designed to learn dynamic DAG structures from potentially nonlinear and non-stationary times series datasets, accounting for changes in both parameters and graph structures. Notably, Meta-D$^2$AG explicitly treats data collected at different time points with distribution shifts as distinct domains, which is assumed to occur as a result of external interventions. Moreover, Meta-D$^2$AG contains a new online meta-learning framework to take advantage of the temporal transition among existing domains such that it can quickly adapt to new domains with few measurements. A first-order optimization approach is utilized to efficiently solve the meta-learning framework, and theoretical analysis establishes the identifiability conditions and the convergence of the learning process. We demonstrate the promising performance of our method through better accuracy and sample efficiency on benchmark datasets against state-of-the-art baselines.
Disentangling misreporting from genuine adaptation in strategic settings: a causal approach
Dylan Zapzalka · Trenton Chang · Lindsay Warrenburg · Sae-Hwan Park · Daniel Shenfeld · Ravi Parikh · Jenna Wiens · Maggie Makar
In settings where ML models are used to inform the allocation of resources, agents affected by the allocation decisions might have an incentive to strategically change their features to secure better outcomes. While prior work has studied strategic responses broadly, disentangling misreporting from genuine adaptation remains a fundamental challenge. In this paper, we propose a causally-motivated approach to identify and quantify how much an agent misreports on average by distinguishing deceptive changes in their features from genuine adaptation. Our key insight is that, unlike genuine adaptation, misreported features do not causally affect downstream variables (i.e., causal descendants). We exploit this asymmetry by comparing the causal effect of misreported features on their causal descendants as derived from manipulated datasets against those from unmanipulated datasets. We formally prove identifiability of the misreporting rate and characterize the variance of our estimator. We empirically validate our theoretical results using a semi-synthetic and real Medicare dataset with misreported data, demonstrating that our approach can be employed to identify misreporting in real-world scenarios.
Counterfactual Image Editing with Disentangled Causal Latent Space
Yushu Pan · Elias Bareinboim
The process of editing an image can be naturally modeled as evaluating a counterfactual query: “What would an image look like if a particular feature had changed?” While recent advances in text-guided image editing leverage powerful pre-trained models to produce visually appealing images, they often lack counterfactual consistency -- ignoring how features are causally related and how changing one may affect others. In contrast, existing causal-based editing approaches offer solid theoretical foundations and perform well in specific settings, but remain limited in scalability and often rely on labeled data. In this work, we aim to bridge the gap between causal editing and large-scale text-to-image generation through two main contributions. First, we introduce Backdoor Disentangled Causal Latent Space (BD-CLS), a new class of latent spaces that allows for the encoding of causal inductive biases. One desirable property of this latent space is that, even under weak supervision, it can be shown to exhibit counterfactual consistency. Second, and building on this result, we develop BD-CLS-Edit, an algorithm capable of learning a BD-CLS from a (non-causal) pre-trained Stable Diffusion model. This enables counterfactual image editing without retraining. Our method ensures that edits respect the causal relationships among features, even when some features are unlabeled or unprompted and the original latent space is oblivious to the environment’s underlying cause-and-effect relationships.
DGCBench: A Deep Graph Clustering Benchmark
Benyu Wu · Yue Liu · Qiaoyu Tan · Xinwang Liu · Wei Du · Jun Wang · Guoxian Yu
Deep graph clustering (DGC) aims to partition graph nodes into distinct clusters in an unsupervised manner. Despite rapid advancements in this field, DGC remains inherently challenging due to the absence of ground-truth, which complicates the design of effective algorithms and impedes the establishment of standardized benchmarks. The lack of unified datasets, evaluation protocols, and metrics further exacerbates these challenges, making it difficult to systematically assess and compare DGC methods. To address these limitations, we introduce $\texttt{DGCBench}$, the first comprehensive and unified benchmark for DGC methods. It evaluates 12 state-of-the-art DGC methods across 12 datasets from diverse domains and scales, spanning 6 critical dimensions: $\textbf{discriminability}$, $\textbf{effectiveness}$, $\textbf{scalability}$, $\textbf{efficiency}$, $\textbf{stability}$, and $\textbf{robustness}$. Additionally, we develop $\texttt{PyDGC}$, an open-source Python library that standardizes the DGC training and evaluation paradigm. Through systematic experiments, we reveal persistent limitations in existing methods, specifically regarding the homophily bottleneck, training instability, vulnerability to perturbations, efficiency plateau, scalability challenges, and poor discriminability, thereby offering actionable insights for future research. We hope that $\texttt{DGCBench}$, $\texttt{PyDGC}$, and our analyses will collectively accelerate the progress in the DGC community. The code is available at https://github.com/Marigoldwu/PyDGC.
Theory-Driven Label-Specific Representation for Incomplete Multi-View Multi-Label Learning
Quanjiang Li · Tianxiang Xu · Tingjin Luo · Yan Zhong · Yang Li · Yiyun Zhou · Chenping Hou
Multi-view multi-label learning typically suffers from dual data incompleteness due to limitations in feature storage and annotation costs. The interplay of hetero geneous features, numerous labels, and missing information significantly degrades model performance. To tackle the complex yet highly practical challenges, we propose a Theory-Driven Label-Specific Representation (TDLSR) framework. Through constructing the view-specific sample topology and prototype association graph, we develop the proximity-aware imputation mechanism, while deriving class representatives that capture the label correlation semantics. To obtain semantically distinct view representations, we introduce principles of information shift, inter action and orthogonality, which promotes the disentanglement of representation information, and mitigates message distortion and redundancy. Besides, label semantic-guided feature learning is employed to identify the discriminative shared and specific representations and refine the label preference across views. Moreover, we theoretically investigate the characteristics of representation learning and the generalization performance. Finally, extensive experiments on public datasets and real-world applications validate the effectiveness of TDLSR.
A TRIANGLE Enables Multimodal Alignment Beyond Cosine Similarity
Giordano Cicchetti · Eleonora Grassucci · Danilo Comminiello
Multimodal learning plays a pivotal role in advancing artificial intelligence systems by incorporating information from multiple modalities to build a more comprehensive representation. Despite its importance, current state-of-the-art models still suffer from severe limitations that prevent the successful development of a fully multimodal model. Such methods do not provide indicators that all the involved modalities are effectively aligned. As a result, a set of modalities may not be aligned, undermining the effectiveness of the model in downstream tasks where multiple modalities should provide additional information that the model fails to exploit. In this paper, we present TRIANGLE: TRI-modAl Neural Geometric LEarning, the novel proposed similarity measure that is directly computed in the higher-dimensional space spanned by the modality embeddings. TRIANGLE improves the joint alignment of three modalities via a triangle‑area similarity, avoiding additional fusion layers. When incorporated in contrastive losses replacing cosine similarity, TRIANGLE significantly boosts the performance of multimodal modeling, while yielding interpretable alignment rationales. Extensive evaluation in three-modal tasks such as video-text and audio-text retrieval or audio-video classification, demonstrates that TRIANGLE achieves state-of-the-art results across different datasets improving the performance of cosine-based methods up to 9 points of Recall@1.
Connecting Neural Models Latent Geometries with Relative Geodesic Representations
Hanlin Yu · Berfin Inal · Georgios Arvanitidis · Søren Hauberg · Francesco Locatello · Marco Fumero
Neural models learn representations of high-dimensional data on low-dimensional manifolds. Multiple factors, including stochasticities in the training process, model architectures, and additional inductive biases, may induce different representations, even when learning the same task on the same data. However, it has recently been shown that when a latent structure is shared between distinct latent spaces, relative distances between representations can be preserved, up to distortions. Building on this idea, we demonstrate that exploiting the differential-geometric structure of latent spaces of neural models, it is possible to capture precisely the transformations between representational spaces trained on similar data distributions. Specifically, we assume that distinct neural models parametrize approximately the same underlying manifold, and introduce a representation based on the pullback metric that captures the intrinsic structure of the latent space, while scaling efficiently to large models. We validate experimentally our method on model stitching and retrieval tasks, covering autoencoders and vision foundation discriminative models, across diverse architectures, datasets, pretraining schemes and modalities. Code is available at the following link.
Bit-swapping Oriented Twin-memory Multi-view Clustering in Lifelong Incomplete Scenarios
Shengju Yu · Pei Zhang · Siwei Wang · Suyuan Liu · Xinhang Wan · Zhibin Dong · Tiejun Li · Xinwang Liu
Although receiving notable improvements, current multi-view clustering (MVC) techniques generally rely on feature library mechanisms to propagate accumulated knowledge from historical views to newly-arrived data, which overlooks the information pertaining to basis embedding within each view. Moreover, the mapping paradigm inevitably alters the values of learned landmarks and built affinities due to the uninterruption nature, accordingly disarraying the hierarchical cluster structures. To mitigate these two issues, we in the paper provide a named BSTM algorithm. Concretely, we firstly synchronize with the distinct dimensions by introducing a group of specialized projectors, and then establish unified anchors for all views collected so far to capture intrinsic patterns. Afterwards, departing from per-view architectures, we devise a shared bipartite graph construction via indicators to quantify similarity, which not only avoids redundant data-recalculations but alleviates the representation distortion caused by fusion. Crucially, there two components are optimized within an integrated framework, and collectively facilitate knowledge transfer upon encountering incoming views. Subsequently, to flexibly do transformation on anchors and meanwhile maintain numerical consistency, we develop a bit-swapping scheme operating exclusively on 0 and 1. It harmonizes anchors on current view and that on previous views through one-hot encoded row and column attributes, and the graph structures are correspondingly reordered to reach a matched configuration. Furthermore, a computationally efficient four-step updating strategy with linear complexity is designed to minimize the associated loss. Extensive experiments organized on publicly-available benchmark datasets with varying missing percentages confirm the superior effectiveness of our BSTM.
DUAL: Learning Diverse Kernels for Aggregated Two-sample and Independence Testing
Zhijian Zhou · Xunye Tian · Liuhua Peng · Chao Lei · Antonin Schrab · Danica J. Sutherland · Feng Liu
To adapt kernel two-sample and independence testing to complex structured data, aggregation of multiple kernels is frequently employed to boost testing power compared to single-kernel tests. However, we observe a phenomenon that directly maximizing multiple kernel-based statistics may result in highly similar kernels that capture highly overlapping information, limiting the effectiveness of aggregation. To address this, we propose an aggregated statistic that explicitly incorporates kernel diversity based on the covariance between different kernels. Moreover, we identify a fundamental challenge: a trade-off between the diversity among kernels and the test power of individual kernels, i.e., the selected kernels should be both effective and diverse. This motivates a testing framework with selection inference, which leverages information from the training phase to select kernels with strong individual performance from the learned diverse kernel pool. We provide rigorous theoretical statements and proofs to show the consistency on the test power and control of Type-I error, along with asymptotic analysis of the proposed statistics. Lastly, we conducted extensive empirical experiments demonstrating the superior performance of our proposed approach across various benchmarks for both two-sample and independence testing.
Harnessing Feature Resonance under Arbitrary Target Alignment for Out-of-Distribution Node Detection
Shenzhi Yang · Junbo Zhao · Sharon Li · Shouqing Yang · Dingyu Yang · Xiaofang Zhang · Haobo Wang
Out-of-distribution (OOD) node detection in graphs is a critical yet challenging task. Most existing approaches rely heavily on fine-grained labeled data to obtain a pre-trained supervised classifier, inherently assuming the existence of a well-defined pretext classification task. However, when such a task is ill-defined or absent, their applicability becomes severely limited. To overcome this limitation, there is an urgent need to propose a more scalable OOD detection method that is independent of both pretext tasks and label supervision. We harness a new phenomenon called Feature Resonance, focusing on the feature space rather than the label space. We observe that, ideally, during the optimization of known ID samples, unknown ID samples undergo more significant representation changes than OOD samples, even when the model is trained to align arbitrary targets. The rationale behind it is that even without gold labels, the local manifold may still exhibit smooth resonance. Based on this, we further develop a novel graph OOD framework, dubbed Resonance-based Separation and Learning (RSL), which comprises two core modules: (i)-a more practical micro-level proxy of feature resonance that measures the movement of feature vectors in one training step. (ii)-integrate with a synthetic OOD node strategy to train an effective OOD classifier. Theoretically, we derive an error bound showing the superior separability of OOD nodes during the resonance period. Extensive experiments on a total of thirteen real-world graph datasets empirically demonstrate that RSL achieves state-of-the-art performance.
Hypergraph-Enhanced Contrastive Learning for Multi-View Clustering with Hyper-Laplacian Regularization
Zhibin Gu · weili wang
Deep multi-view clustering (DMVC) has emerged as a promising paradigm for integrating information from multiple views by leveraging the representation power of deep neural networks. However, most existing DMVC methods primarily focus on modeling pairwise relationships between samples, while neglecting higher-order structural dependencies among multiple samples, which may hinder further improvements in clustering performance. To address this limitation, we propose a hypergraph neural network (HGNN)-driven multi-view clustering framework, termed Hypergraph-enhanced cOntrastive learning with hyPEr-Laplacian regulaRization (HOPER), a novel model that jointly captures high-order correlations and preserves local manifold structures across views. Specifically, we first construct view-specific hypergraph structures and employ the HGNN to learn node representations, thereby capturing high-order relationships among samples. Furthermore, we design a hypergraph-driven dual contrastive learning mechanism that integrates inter-view contrastive learning with intra-hyperedge contrastive learning, promoting cross-view consistency while maintaining discriminability within hyperedges. Finally, a hyper-Laplacian manifold regularization is introduced to preserve the local geometric structure within each view, thereby enhancing the structural fidelity and discriminative power of the learned representations. Extensive experiments on diverse datasets demonstrate the effectiveness of our approach.
IA-GGAD: Zero-shot Generalist Graph Anomaly Detection via Invariant and Affinity Learning
Xiong Zhang · Zhenli He · Changlong Fu · Cheng Xie
Generalist Graph Anomaly Detection (GGAD) extends traditional Graph Anomaly Detection (GAD) from one-for-one to one-for-all scenarios, posing significant challenges due to Feature Space Shift (FSS) and Graph Structure Shift (GSS). This paper first formalizes these challenges and proposes quantitative metrics to measure their severity. To tackle FSS, we develop an anomaly-driven graph invariant learning module that learns domain-invariant node representations. To address GSS, a novel structure-insensitive affinity learning module is introduced, capturing cross-domain structural correspondences via affinity-based features. Our unified framework, IA-GGAD, integrates these modules, enabling anomaly prediction on unseen graphs without target-domain retraining or fine-tuning. Extensive experiments on benchmark datasets from varied domains demonstrate IA-GGAD’s superior performance, significantly outperforming state-of-the-art methods (e.g., achieving up to +12.28\% AUROC over ARC on ACM). Ablation studies further confirm the effectiveness of each proposed module. The code is available at \url{https://github.com/kg-cc/IA-GGAD/}.
Heterogeneous Adversarial Play in Interactive Environments
Manjie Xu · Xinyi Yang · Jiayu Zhan · Wei Liang · Chi Zhang · Yixin Zhu
Self-play constitutes a fundamental paradigm for autonomous skill acquisition, whereby agents iteratively enhance their capabilities through self-directed environmental exploration. Conventional self-play frameworks exploit agent symmetry within zero-sum competitive settings, yet this approach proves inadequate for open-ended learning scenarios characterized by inherent asymmetry. Human pedagogical systems exemplify asymmetric instructional frameworks wherein educators systematically construct challenges calibrated to individual learners' developmental trajectories. The principal challenge resides in operationalizing these asymmetric, adaptive pedagogical mechanisms within artificial systems capable of autonomously synthesizing appropriate curricula without predetermined task hierarchies. Here we present Heterogeneous Adversarial Play (HAP), an adversarial Automatic Curriculum Learning framework that formalizes teacher-student interactions as a minimax optimization wherein task-generating instructor and problem-solving learner co-evolve through adversarial dynamics. In contrast to prevailing automatic curriculum learning methodologies that employ static curricula or unidirectional task selection mechanisms, HAP establishes a bidirectional feedback system wherein instructors continuously recalibrate task complexity in response to real-time learner performance metrics. Experimental validation across multi-task learning domains demonstrates that our framework achieves performance parity with SOTA baselines while generating curricula that enhance learning efficacy in both artificial agents and human subjects.
Are Greedy Task Orderings Better Than Random in Continual Linear Regression?
Matan Tsipory · Ran Levinstein · Itay Evron · Mark Kong · Deanna Needell · Daniel Soudry
We analyze task orderings in continual learning for linear regression, assuming joint realizability of training data. We focus on orderings that greedily maximize dissimilarity between consecutive tasks, a concept briefly explored in prior work but still surrounded by open questions. Using tools from the Kaczmarz method literature, we formalize such orderings and develop geometric and algebraic intuitions around them. Empirically, we demonstrate that greedy orderings converge faster than random ones in terms of the average loss across tasks, both for linear regression with random data and for linear probing on CIFAR-100 classification tasks. Analytically, in a high-rank regression setting, we prove a loss bound for greedy orderings analogous to that of random ones. However, under general rank, we establish a repetition-dependent separation. Specifically, while prior work showed that for random orderings, with or without replacement, the average loss after $k$ iterations is bounded by $\\mathcal{O}(1/\\sqrt{k})$—we prove that single-pass greedy orderings may fail catastrophically, whereas those allowing repetition converge at rate $\\mathcal{O}(1/\\sqrt[3]{k})$. Overall, we reveal nuances within and between greedy and random orderings.
Covariances for Free: Exploiting Mean Distributions for Training-free Federated Learning
Dipam Goswami · Simone Magistri · Kai Wang · Bartłomiej Twardowski · Andrew Bagdanov · Joost van de Weijer
Using pre-trained models has been found to reduce the effect of data heterogeneity and speed up federated learning algorithms. Recent works have explored training-free methods using first- and second-order statistics to aggregate local client data distributions at the server and achieve high performance without any training. In this work, we propose a training-free method based on an unbiased estimator of class covariance matrices which only uses first-order statistics in the form of class means communicated by clients to the server. We show how these estimated class covariances can be used to initialize the global classifier, thus exploiting the covariances without actually sharing them. We also show that using only within-class covariances results in a better classifier initialization. Our approach improves performance in the range of 4-26% with exactly the same communication cost when compared to methods sharing only class means and achieves performance competitive or superior to methods sharing second-order statistics with dramatically less communication overhead. The proposed method is much more communication-efficient than federated prompt-tuning methods and still outperforms them. Finally, using our method to initialize classifiers and then performing federated fine-tuning or linear probing again yields better performance. Code is available at https://github.com/dipamgoswami/FedCOF.
A compressive-expressive communication framework for compositional representations
Rafael Elberg · Felipe del Río · Mircea Petrache · Denis Parra
Compositionality in knowledge and language—the ability to represent complex concepts as a combination of simpler ones—is a hallmark of human cognition and communication. Despite recent advances, deep neural networks still struggle to acquire this property reliably. Neural models for emergent communication look to endow artificial agents with compositional language by simulating the pressures that form human language. In this work, we introduce CELEBI (Compressive-Expressive Language Emergence through a discrete Bottleneck and Iterated learning), a novel self-supervised framework for inducing compositional representations through a reconstruction-based communication game between a sender and a receiver. Building on theories of language emergence and the iterated learning framework, we integrate three mechanisms that jointly promote compressibility, expressivity, and efficiency in the emergent language. First, Progressive Decoding incentivizes intermediate reasoning by requiring the receiver to produce partial reconstructions after each symbol. Second, Final-State Imitation trains successive generations of agents to imitate reconstructions rather than messages, enforcing a tighter communication bottleneck. Third, Pairwise Distance Maximization regularizes message diversity by encouraging high distances between messages, with formal links to entropy maximization. Our method significantly improves both the efficiency and compositionality of the learned messages on the Shapes3D and MPI3D datasets, surpassing prior discrete communication frameworks in both reconstruction accuracy and topographic similarity. This work provides new theoretical and empirical evidence for the emergence of structured, generalizable communication protocols from simplicity-based inductive biases.
Minimal Semantic Sufficiency Meets Unsupervised Domain Generalization
Tan Pan · Kaiyu Guo · Dongli Xu · Zhaorui Tan · Chen Jiang · Deshu Chen · Xin Guo · Brian Lovell · LIMEI HAN · Yuan Cheng · Mahsa Baktashmotlagh
The generalization ability of deep learning has been extensively studied in supervised settings, yet it remains less explored in unsupervised scenarios. Recently, the Unsupervised Domain Generalization (UDG) task has been proposed to enhance the generalization of models trained with prevalent unsupervised learning techniques, such as Self-Supervised Learning (SSL). UDG confronts the challenge of distinguishing semantics from variations without category labels. Although some recent methods have employed domain labels to tackle this issue, such domain labels are often unavailable in real-world contexts. In this paper, we address these limitations by formalizing UDG as the task of learning a Minimal Sufficient Semantic Representation: a representation that (i) preserves all semantic information shared across augmented views (sufficiency), and (ii) maximally removes information irrelevant to semantics (minimality). We theoretically ground these objectives from the perspective of information theory, demonstrating that optimizing representations to achieve sufficiency and minimality directly reduces out-of-distribution risk. Practically, we implement this optimization through Minimal-Sufficient UDG (MS-UDG), a learnable model by integrating (a) an InfoNCE-based objective to achieve sufficiency; (b) two complementary components to promote minimality: a novel semantic-variation disentanglement loss and a reconstruction-based mechanism for capturing adequate variation. Empirically, MS-UDG sets a new state-of-the-art on popular unsupervised domain-generalization benchmarks, consistently outperforming existing SSL and UDG methods, without category or domain labels during representation learning.
Thought Communication in Multiagent Collaboration
Yujia Zheng · Zhuokai Zhao · Zijian Li · Yaqi Xie · Mingze Gao · Lizhu Zhang · Kun Zhang
Natural language has long enabled human cooperation, but its lossy, ambiguous, and indirect nature limits the potential of collective intelligence. While machines are not subject to these constraints, most LLM-based multi-agent systems still rely solely on natural language, exchanging tokens or their embeddings. To go beyond language, we introduce a new paradigm, thought communication, which enables agents to interact directly mind-to-mind, akin to telepathy. To uncover these latent thoughts in a principled way, we formalize the process as a general latent variable model, where agent states are generated by an unknown function of underlying thoughts. We prove that, in a nonparametric setting without auxiliary information, both shared and private latent thoughts between any pair of agents can be identified. Moreover, the global structure of thought sharing, including which agents share which thoughts and how these relationships are structured, can also be recovered with theoretical guarantees. Guided by the established theory, we develop a framework that extracts latent thoughts from all agents prior to communication and assigns each agent the relevant thoughts, along with their sharing patterns. This paradigm naturally extends beyond LLMs to all modalities, as most observational data arise from hidden generative processes. Experiments on both synthetic and real-world benchmarks validate the theory and demonstrate the collaborative advantages of thought communication. We hope this work illuminates the potential of leveraging the hidden world, as many challenges remain unsolvable through surface-level observation alone, regardless of compute or data scale.
Exploring Structural Degradation in Dense Representations for Self-supervised Learning
Siran Dai · Qianqian Xu · Peisong Wen · Yang Liu · Qingming Huang
In this work, we observe a counterintuitive phenomenon in self-supervised learning (SSL): longer training may impair the performance of dense prediction tasks (e.g., semantic segmentation). We refer to this phenomenon as Self-supervised Dense Degradation (SDD) and demonstrate its consistent presence across sixteen state-of-the-art SSL methods with various losses, architectures, and datasets. When the model performs suboptimally on dense tasks at the end of training, measuring the performance during training becomes essential. However, evaluating dense performance effectively without annotations remains an open challenge. To tackle this issue, we introduce a Dense representation Structure Estimator (DSE), composed of a class-relevance measure and an effective dimensionality measure. The proposed DSE is both theoretically grounded and empirically validated to be closely correlated with the downstream performance. Based on this metric, we introduce a straightforward yet effective model selection strategy and a DSE-based regularization method. Experiments on sixteen SSL methods across four benchmarks confirm that model selection improves mIoU by $3.0\\%$ on average with negligible computational cost. Additionally, DSE regularization consistently mitigates the effects of dense degradation. Code is available at \url{https://github.com/EldercatSAM/SSL-Degradation}.
Understanding Contrastive Learning via Gaussian Mixture Models
Parikshit Bansal · Ali Kavis · Sujay Sanghavi
Contrastive learning involves learning representations via a loss function that encourages each (unlabeled) sample to be far from other samples, but close to its own augmentation. In this paper, we aim to understand why this simple idea performs remarkably well, by theoretically analyzing it for a simple, natural problem setting: dimensionality reduction in Gaussian Mixture Models (GMMs). Note that the standard GMM setup lacks the concept of augmentations. We study an intuitive extension: we define the pair of data sample and its augmentation as a coupled random draw from the GMM such that the marginal over the "noisy" augmentation is biased towards the component of the data sample. For this setup, we show that vanilla contrastive loss, e.g., InfoNCE, is able to find the optimal lower-dimensional subspace even when the Gaussian components are non-isotropic. In particular, we show that InfoNCE can match the performance of a fully supervised algorithm, e.g., LDA, (where each data point is labeled with the mixture component it comes from) even when the augmentations are "noisy". We further extend our setup to the multi-modal case, and develop a GMM-like setting to study the contrastive CLIP loss. We corroborate our theoretical with real-data experiments on CIFAR100; representations learned by InfoNCE loss match the performance of LDA on clustering metrics.
CyIN: Cyclic Informative Latent Space for Bridging Complete and Incomplete Multimodal Learning
Ronghao Lin · Qiaolin He · Sijie Mai · Ying Zeng · Aolin Xiong · Li Huang · Yap-Peng Tan · Haifeng Hu
Multimodal machine learning, mimicking the human brain’s ability to integrate various modalities has seen rapid growth. Most previous multimodal models are trained on perfectly paired multimodal input to reach optimal performance. In real‑world deployments, however, the presence of modality is highly variable and unpredictable, causing the pre-trained models in suffering significant performance drops and fail to remain robust with dynamic missing modalities circumstances. In this paper, we present a novel Cyclic INformative Learning framework (CyIN) to bridge the gap between complete and incomplete multimodal learning. Specifically, we firstly build an informative latent space by adopting token- and label-level Information Bottleneck (IB) cyclically among various modalities. Capturing task-related features with variational approximation, the informative bottleneck latents are purified for more efficient cross-modal interaction and multimodal fusion. Moreover, to supplement the missing information caused by incomplete multimodal input, we propose cross-modal cyclic translation by reconstruct the missing modalities with the remained ones through forward and reverse propagation process. With the help of the extracted and reconstructed informative latents, CyIN succeeds in jointly optimizing complete and incomplete multimodal learning in one unified model. Extensive experiments on 4 multimodal datasets demonstrate the superior performance of our method in both complete and diverse incomplete scenarios.
Auto-Compressing Networks
Evangelos Dorovatas · Georgios Paraskevopoulos · Alexandros Potamianos
Deep neural networks with short residual connections have demonstrated remarkable success across domains, but increasing depth often introduces computational redundancy without corresponding improvements in representation quality. We introduce Auto-Compressing Networks (ACNs), an architectural variant where additive long feedforward connections from each layer to the output replace traditional short residual connections. By analyzing the distinct dynamics induced by this modification, we reveal a unique property we coin as auto-compression—the ability of a network to organically compress information during training with gradient descent, through architectural design alone. Through auto-compression, information is dynamically "pushed" into early layers during training, enhancing their representational quality and revealing potential redundancy in deeper ones. We theoretically show that this property emerges from layer-wise training patterns found only in ACNs, where layers are dynamically utilized during training based on task requirements. We also find that ACNs exhibit enhanced noise robustness compared to residual networks, superior performance in low-data settings, improved transfer learning capabilities, and mitigate catastrophic forgetting suggesting that they learn representations that generalize better despite using fewer parameters. Our results demonstrate up to 18\% reduction in catastrophic forgetting and 30-80\% architectural compression while maintaining accuracy across vision transformers, MLP-mixers, and BERT architectures. These findings establish ACNs as a practical approach to developing efficient neural architectures that automatically adapt their computational footprint to task complexity, while learning robust representations suitable for noisy real-world tasks and continual learning scenarios.
Place Cells as Multi-Scale Position Embeddings: Random Walk Transition Kernels for Path Planning
Minglu Zhao · Dehong Xu · Deqian Kong · Wen-Hao Zhang · Ying Nian Wu
The hippocampus supports spatial navigation by encoding cognitive maps through collective place cell activity. We model the place cell population as non-negative spatial embeddings derived from the spectral decomposition of multi-step random walk transition kernels. In this framework, inner product or equivalently Euclidean distance between embeddings encode similarity between locations in terms of their transition probability across multiple scales, forming a cognitive map of adjacency. The combination of non-negativity and inner-product structure naturally induces sparsity, providing a principled explanation for the localized firing fields of place cells without imposing explicit constraints. The temporal parameter that defines the diffusion scale also determines field size, aligning with the hippocampal dorsoventral hierarchy. Our approach constructs global representations efficiently through recursive composition of local transitions, enabling smooth, trap-free navigation and preplay-like trajectory generation. Moreover, theta phase arises intrinsically as the angular relation between embeddings, linking spatial and temporal coding within a single representational geometry.
Equivariant Eikonal Neural Networks: Grid-Free, Scalable Travel-Time Prediction on Homogeneous Spaces
Alejandro García-Castellanos · David Wessels · Nicky J. van den Berg · Remco Duits · Daniel M. Pelt · Erik Bekkers
We introduce Equivariant Neural Eikonal Solvers, a novel framework that integrates Equivariant Neural Fields (ENFs) with Neural Eikonal Solvers. Our approach employs a single neural field where a unified shared backbone is conditioned on signal-specific latent variables – represented as point clouds in a Lie group – to model diverse Eikonal solutions. The ENF integration ensures equivariant mapping from these latent representations to the solution field, delivering three key benefits: enhanced representation efficiency through weight-sharing, robust geometric grounding, and solution steerability. This steerability allows transformations applied to the latent point cloud to induce predictable, geometrically meaningful modifications in the resulting Eikonal solution. By coupling these steerable representations with Physics-Informed Neural Networks (PINNs), our framework accurately models Eikonal travel-time solutions while generalizing to arbitrary Riemannian manifolds with regular group actions. This includes homogeneous spaces such as Euclidean, position–orientation, spherical, and hyperbolic manifolds. We validate our approach through applications in seismic travel-time modeling of 2D, 3D, and spherical benchmark datasets. Experimental results demonstrate superior performance, scalability, adaptability, and user controllability compared to existing Neural Operator-based Eikonal solver methods.
The Structure of Relation Decoding Linear Operators in Large Language Models
Miranda Anna Christ · Adrián Csiszárik · Gergely Becsó · Dániel Varga
This paper investigates the structure of linear operators introduced in Hernandez et al. [2023] that decode specific relational facts in transformer language models. We extend their single-relation findings to a collection of relations and systematically chart their organization. We show that such collections of relation decoders can be highly compressed by simple order-3 tensor networks without significant loss in decoding accuracy. To explain this surprising redundancy, we develop a cross-evaluation protocol, in which we apply each linear decoder operator to the subjects of every other relation. Our results reveal that these linear maps do not encode distinct relations, but extract recurring, coarse-grained semantic properties (e.g., country of capital city and country of food are both in the country-of-X property). This property-centric structure clarifies both the operators' compressibility and highlights why they generalize only to new relations that are semantically close. Our findings thus interpret linear relational decoding in transformer language models as primarily property-based, rather than relation-specific.
Soft Task-Aware Routing of Experts for Equivariant Representation Learning
Jaebyeong Jeon · Hyunseo Jang · Jy-yong Sohn · Kibok Lee
Equivariant representation learning aims to capture variations induced by input transformations in the representation space, whereas invariant representation learning encodes semantic information by disregarding such transformations. Recent studies have shown that jointly learning both types of representations is often beneficial for downstream tasks, typically by employing separate projection heads. However, this design overlooks information shared between invariant and equivariant learning, which leads to redundant feature learning and inefficient use of model capacity. To address this, we introduce \textbf{S}oft \textbf{T}ask-\textbf{A}ware \textbf{R}outing (STAR), a routing strategy for projection heads that models them as experts. STAR induces the experts to specialize in capturing either shared or task-specific information, thereby reducing redundant feature learning. We validate this effect by observing lower canonical correlations between invariant and equivariant embeddings. Experimental results show consistent improvements across diverse transfer learning tasks. The code is available at \url{https://github.com/YonseiML/star}.
Random Forest Autoencoders for Guided Representation Learning
Adrien Aumon · Shuang Ni · Myriam Lizotte · Guy Wolf · Kevin Moon · Jake Rhodes
Extensive research has produced robust methods for unsupervised data visualization. Yet supervised visualization—where expert labels guide representations—remains underexplored, as most supervised approaches prioritize classification over visualization. Recently, RF-PHATE, a diffusion-based manifold learning method leveraging random forests and information geometry, marked significant progress in supervised visualization. However, its lack of an explicit mapping function limits scalability and its application to unseen data, posing challenges for large datasets and label-scarce scenarios. To overcome these limitations, we introduce Random Forest Autoencoders (RF-AE), a neural network-based framework for out-of-sample kernel extension that combines the flexibility of autoencoders with the supervised learning strengths of random forests and the geometry captured by RF-PHATE. RF-AE enables efficient out-of-sample supervised visualization and outperforms existing methods, including RF-PHATE's standard kernel extension, in both accuracy and interpretability. Additionally, RF-AE is robust to the choice of hyperparameters and generalizes to any kernel-based dimensionality reduction method.
Exploiting Task Relationships in Continual Learning via Transferability-Aware Task Embeddings
Yanru Wu · Jianning Wang · Xiangyu Chen · Aurora · Yang Tan · Hanbing Liu · Yang Li
Continual learning (CL) has been a critical topic in contemporary deep neural network applications, where higher levels of both forward and backward transfer are desirable for an effective CL performance. Existing CL strategies primarily focus on task models — either by regularizing model updates or by separating task-specific and shared components — while often overlooking the potential of leveraging inter-task relationships to enhance transfer. To address this gap, we propose a transferability-aware task embedding, termed H-embedding, and construct a hypernet framework under its guidance to learn task-conditioned model weights for CL tasks. Specifically, H-embedding is derived from an information theoretic measure of transferability and is designed to be online and easy to compute. Our method is also characterized by notable practicality, requiring only the storage of a low-dimensional task embedding per task and supporting efficient end-to-end training. Extensive evaluations on benchmarks including CIFAR-100, ImageNet-R, and DomainNet show that our framework performs prominently compared to various baseline and SOTA approaches, demonstrating strong potential in capturing and utilizing intrinsic task relationships. Our code is publicly available at \url{https://github.com/viki760/H-embedding-Guided-Hypernet}.
Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation
Yuyang Huang · Yabo Chen · Junyu Zhou · Wenrui Dai · XIAOPENG ZHANG · Junni Zou · Hongkai Xiong · Qi Tian
Source-free domain adaptation (SFDA) is a challenging task that tackles domain shifts using only a pre-trained source model and unlabeled target data. Existing SFDA methods are restricted by the fundamental limitation of source-target domain discrepancy. Non-generation SFDA methods suffer from unreliable pseudo-labels in challenging scenarios with large domain discrepancies, while generation-based SFDA methods are evidently degraded due to enlarged domain discrepancies in creating pseudo-source data. To address this limitation, we propose a novel generation-based framework named Diffusion-Driven Progressive Target Manipulation (DPTM) that leverages unlabeled target data as references to reliably generate and progressively refine a pseudo-target domain for SFDA. Specifically, we divide the target samples into a trust set and a non-trust set based on the reliability of pseudo-labels to sufficiently and reliably exploit their information. For samples from the non-trust set, we develop a manipulation strategy to semantically transform them into the newly assigned categories, while simultaneously maintaining them in the target distribution via a latent diffusion model. Furthermore, we design a progressive refinement mechanism that progressively reduces the domain discrepancy between the pseudo-target domain and the real target domain via iterative refinement. Experimental results demonstrate that DPTM outperforms existing methods by a large margin and achieves state-of-the-art performance on four prevailing SFDA benchmark datasets with different scales. Remarkably, DPTM can significantly enhance the performance by up to 18.6\% in scenarios with large source-target gaps.
Transfer Learning for Benign Overfitting in High-Dimensional Linear Regression
Yeichan Kim · Ilmun Kim · Seyoung Park
Transfer learning is a key component of modern machine learning, enhancing the performance of target tasks by leveraging diverse data sources. Simultaneously, overparameterized models such as the minimum-$\ell_2$-norm interpolator (MNI) in high-dimensional linear regression have garnered significant attention for their remarkable generalization capabilities, a property known as *benign overfitting*. Despite their individual importance, the intersection of transfer learning and MNI remains largely unexplored. Our research bridges this gap by proposing a novel two-step Transfer MNI approach and analyzing its trade-offs. We characterize its non-asymptotic excess risk and identify conditions under which it outperforms the target-only MNI. Our analysis reveals *free-lunch* covariate shift regimes, where leveraging heterogeneous data yields the benefit of knowledge transfer at limited cost. To operationalize our findings, we develop a data-driven procedure to detect *informative* sources and introduce an ensemble method incorporating multiple informative Transfer MNIs. Finite-sample experiments demonstrate the robustness of our methods to model and data heterogeneity, confirming their advantage.
Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds
Fan Wang · Pengtao Shao · Yiming Zhang · Bo Yu · Shaoshan Liu · Ning Ding · Yang Cao · Yu Kang · Haifeng Wang
In-Context Reinforcement Learning (ICRL) enables agents to learn automatically and on-the-fly from their interactive experiences. However, a major challenge in scaling up ICRL is the lack of scalable task collections. To address this, we propose the procedurally generated tabular Markov Decision Processes, named AnyMDP. Through a carefully designed randomization process, AnyMDP is capable of generating high-quality tasks on a large scale while maintaining relatively low structural biases. To facilitate efficient meta-training at scale, we further introduce decoupled policy distillation and induce prior information in the ICRL framework. Our results demonstrate that, with a sufficiently large scale of AnyMDP tasks, the proposed model can generalize to tasks that were not considered in the training set through versatile in-context learning paradigms. The scalable task set provided by AnyMDP also enables a more thorough empirical investigation of the relationship between data distribution and ICRL performance. We further show that the generalization of ICRL potentially comes at the cost of increased task diversity and longer adaptation periods. This finding carries critical implications for scaling robust ICRL capabilities, highlighting the necessity of diverse and extensive task design, and prioritizing asymptotic performance over few-shot adaptation.
Mixed-Sample SGD: an End-to-end Analysis of Supervised Transfer Learning
Yuyang Deng · Samory Kpotufe
Theoretical works on supervised transfer learning (STL)---where the learner has access to labeled samples from both source and target distributions---have for the most part focused on statistical aspects of the problem, while efficient optimization has received less attention. We consider the problem of designing an SGD procedure for STL that alternates sampling between source and target data, while maintaining statistical transfer guarantees without prior knowledge of the quality of the source data. A main algorithmic difficulty is in understanding how to design such an adaptive sub-sampling mechanism at each SGD step, to automatically gain from the source when it is informative, or bias towards the target and avoid negative transfer when the source is less informative. We show that, such a mixed-sample SGD procedure is feasible for general prediction tasks with convex losses, rooted in tracking an abstract sequence of constrained convex programs that serve to maintain the desired transfer guarantees. We instantiate these results in the concrete setting of linear regression with square loss, and show that the procedure converges, with $1/\sqrt{T}$ rate, to a solution whose statistical performance on the target is adaptive to the a priori unknown quality of the source. Experiments with synthetic and real datasets support the theory.
A High-Dimensional Statistical Method for Optimizing Transfer Quantities in Multi-Source Transfer Learning
Qingyue Zhang · Haohao Fu · Guanbo Huang · Yaoyuan Liang · Chang Chu · Tianren Peng · Yanru Wu · Qi Li · Yang Li · Shao-Lun Huang
Multi-source transfer learning provides an effective solution to data scarcity in real-world supervised learning scenarios by leveraging multiple source tasks. In this field, existing works typically use all available samples from sources in training, which constrains their training efficiency and may lead to suboptimal results. To address this, we propose a theoretical framework that answers the question: what is the optimal quantity of source samples needed from each source task to jointly train the target model? Specifically, we introduce a generalization error measure based on K-L divergence, and minimize it based on high-dimensional statistical analysis to determine the optimal transfer quantity for each source task. Additionally, we develop an architecture-agnostic and data-efficient algorithm OTQMS to implement our theoretical results for target model training in multi-source transfer learning. Experimental studies on diverse architectures and two real-world benchmark datasets show that our proposed algorithm significantly outperforms state-of-the-art approaches in both accuracy and data efficiency. The code is available at https://github.com/zqy0126/OTQMS.
MoORE: SVD-based Model MoE-ization for Conflict- and Oblivion-Resistant Multi-Task Adaptation
Shen Yuan · Yin Zheng · Taifeng Wang · BINBINLIU · Hongteng Xu
Adapting large-scale foundation models in multi-task scenarios often suffers from task conflict and oblivion. To mitigate such issues, we propose a novel "model MoE-ization" strategy that leads to a conflict- and oblivion-resistant multi-task adaptation method. Given a weight matrix of a pre-trained model, our method applies SVD to it and introduces a learnable router to adjust its singular values based on tasks and samples. Accordingly, the weight matrix becomes a Mixture of Orthogonal Rank-one Experts (MoORE), in which each expert corresponds to the outer product of a left singular vector and the corresponding right one. We can improve the model capacity by imposing a learnable orthogonal transform on the right singular vectors. Unlike low-rank adaptation (LoRA) and its MoE-driven variants, MoORE guarantees the experts' orthogonality and maintains the column space of the original weight matrix. These two properties make the adapted model resistant to the conflicts among the new tasks and the oblivion of its original tasks, respectively. Experiments on various datasets demonstrate that MoORE outperforms existing multi-task adaptation methods consistently, showing its superiority in terms of conflict- and oblivion-resistance. The code is available at https://github.com/DaShenZi721/MoORE.
Self-Training with Dynamic Weighting for Robust Gradual Domain Adaptation
Zixi Wang · Yushe Cao · Yubo Huang · Jinzhu Wei · Jingzehua Xu · Shuai Zhang · Xin Lai
In this paper, we propose a new method called \textit{Self-Training with Dynamic Weighting} (STDW), which aims to enhance robustness in Gradual Domain Adaptation (GDA) by addressing the challenge of smooth knowledge migration from the source to the target domain. Traditional GDA methods mitigate domain shift through intermediate domains and self-training but often suffer from inefficient knowledge migration or incomplete intermediate data. Our approach introduces a dynamic weighting mechanism that adaptively balances the loss contributions of the source and target domains during training. Specifically, we design an optimization framework governed by a time-varying hyperparameter $\varrho$ (progressing from 0 to 1), which controls the strength of domain-specific learning and ensures stable adaptation. The method leverages self-training to generate pseudo-labels and optimizes a weighted objective function for iterative model updates, maintaining robustness across intermediate domains. Experiments on rotated MNIST, color-shifted MNIST, portrait datasets, and the Cover Type dataset demonstrate that STDW outperforms existing baselines. Ablation studies further validate the critical role of $\varrho$'s dynamic scheduling in achieving progressive adaptation, confirming its effectiveness in reducing domain bias and improving generalization. This work provides both theoretical insights and a practical framework for robust gradual domain adaptation, with potential applications in dynamic real-world scenarios.
SMRS: advocating a unified reporting standard for surrogate models in the artificial intelligence era.
Elizaveta Semenova · Siobhan Mackenzie Hall · Timothy James Hitge · Alisa Sheinkman · Jon Cockayne
Surrogate models are widely used to approximate complex systems across science and engineering to reduce computational costs. Despite their widespread adoption, the field lacks standardisation across key stages of the modelling pipeline, including data sampling, model selection, evaluation, and downstream analysis. This fragmentation limits reproducibility and cross-domain utility – a challengefurther exacerbated by the rapid proliferation of AI-driven surrogate models. We argue for the urgent need to establish a structured reporting standard, the Surrogate Model Reporting Standard (SMRS), that systematically captures essential design and evaluation choices while remaining agnostic to implementation specifics. By promoting a standardised yet flexible framework, we aim to improve the reliabilityof surrogate modelling, foster interdisciplinary knowledge transfer, and, as a result, accelerate scientific progress in the AI era.
TabArena: A Living Benchmark for Machine Learning on Tabular Data
Nick Erickson · Lennart Purucker · Andrej Tschalzev · David Holzmüller · Prateek Desai · David Salinas · Frank Hutter
With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.
Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning
Zhonghao He · Tianyi (Alex) Qiu · Hirokazu Shirado · Maarten Sap
Recent advances in reasoning techniques have substantially improved the performance of large language models (LLMs), raising expectations for their ability to provide accurate, truthful, and reliable information. However, emerging evidence suggests that iterative reasoning may foster belief entrenchment and confirmation bias, rather than enhancing truth-seeking behavior. In this study, we propose a systematic evaluation framework for belief entrenchment in LLM reasoning by leveraging the Martingale property from Bayesian statistics. This property implies that, under rational belief updating, the expected value of future beliefs should remain equal to the current belief, i.e., belief updates are unpredictable from the current belief. We propose the unsupervised, regression-based Martingale Score to measure violations of this property, which signal deviation from the Bayesian ability of updating on new evidence. In open-ended problem domains including event forecasting, value-laden questions, and academic paper review, we find such violations to be widespread across models and setups, where the current belief positively predicts future belief updates, a phenomenon which we term belief entrenchment. We identify the models, reasoning techniques, and domains more prone to belief entrenchment. Finally, we validate the Martingale Score by showing that it predicts ground-truth accuracy on problem domains where ground truth labels are available. This indicates that, while designed as an unsupervised metric that operates even in domains without access to ground truth, the Martingale Score is a useful proxy of the truth-seeking ability of a reasoning process.
WALL-E: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents
Siyu Zhou · Tianyi Zhou · Yijun Yang · Guodong Long · Deheng Ye · Jing Jiang · Chengqi Zhang
Can we build accurate world models out of large language models (LLMs)? How can world models benefit LLM agents? The gap between the prior knowledge of LLMs and the specified environment's dynamics usually bottlenecks LLMs' performance as world models. To bridge the gap, we propose a training-free "world alignment" that learns an environment's symbolic knowledge complementary to LLMs. The symbolic knowledge covers action rules, knowledge graphs, and scene graphs, which are extracted by LLMs from exploration trajectories and encoded into executable codes to regulate LLM agents' policies. We further propose an RL-free, model-based agent "WALL-E" through the model-predictive control (MPC) framework. Unlike classical MPC requiring costly optimization on the fly, we adopt an LLM agent as an efficient look-ahead optimizer of future steps' actions by interacting with the neurosymbolic world model. While the LLM agent's strong heuristics make it an efficient planner in MPC, the quality of its planned actions is also secured by the accurate predictions of the aligned world model. They together considerably improve learning efficiency in a new environment. On open-world challenges in Mars (Minecraft like) and ALFWorld (embodied indoor environments), WALL-E significantly outperforms existing methods, e.g., surpassing baselines in Mars by 16.1%–51.6% of success rate and by at least 61.7% in score. In ALFWorld, it achieves a new record 98% success rate after only 4 iterations.
Improving Decision Trees through the Lens of Parameterized Local Search
Juha Harviainen · Frank Sommer · Manuel Sorge
Algorithms for learning decision trees often include heuristic local-search operations such as (1) adjusting the threshold of a cut or (2) also exchanging the feature of that cut. We study minimizing the number of classification errors by performing a fixed number of a single type of these operations. Although we discover that the corresponding problems are NP-complete in general, we provide a comprehensive parameterized-complexity analysis with the aim of determining those properties of the problems that explain the hardness and those that make the problems tractable. For instance, we show that the problems remain hard for a small number $d$ of features or small domain size $D$ but the combination of both yields fixed-parameter tractability. That is, the problems are solvable in $(D + 1)^{2d} \cdot |\mathcal{I}|^{O(1)}$ time, where $|\mathcal{I}|$ is the size of the input. We also provide a proof-of-concept implementation of this algorithm and report on empirical results.
Semi-supervised Vertex Hunting, with Applications in Network and Text Analysis
Yicong Jiang · Zheng Tracy Ke
Vertex hunting (VH) is the task of estimating a simplex from noisy data points and has many applications in areas such as network and text analysis. We introduce a new variant, semi-supervised vertex hunting (SSVH), in which partial information is available in the form of barycentric coordinates for some data points, known only up to an unknown transformation. To address this problem, we develop a method that leverages properties of orthogonal projection matrices, drawing on novel insights from linear algebra. We establish theoretical error bounds for our method and demonstrate that it achieves a faster convergence rate than existing unsupervised VH algorithms. Finally, we apply SSVH to two practical settings---semi-supervised network mixed membership estimation and semi-supervised topic modeling---resulting in efficient and scalable algorithms.
Provable Scaling Laws for the Test-Time Compute of Large Language Models
Yanxi Chen · Xuchen Pan · Yaliang Li · Bolin Ding · Jingren Zhou
We propose two simple, principled and practical algorithms that enjoy provable scaling laws for the test-time compute of large language models (LLMs). The first one is a two-stage knockout-style algorithm: given an input problem, it first generates multiple candidate solutions, and then aggregate them via a knockout tournament for the final output. Assuming that the LLM can generate a correct solution with non-zero probability and do better than a random guess in comparing a pair of correct and incorrect solutions, we prove theoretically that the failure probability of this algorithm decays to zero exponentially or by a power law (depending on the specific way of scaling) as its test-time compute grows. The second one is a two-stage league-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents, rather than eliminated upon loss to a single opponent. Under analogous but more robust assumptions, we prove that its failure probability also decays to zero exponentially with more test-time compute. Both algorithms require a black-box LLM and nothing else (e.g., no verifier or reward model) for a minimalistic implementation, which makes them appealing for practical applications and easy to adapt for different tasks. Through extensive experiments with diverse models and datasets, we validate the proposed theories and demonstrate the outstanding scaling properties of both algorithms.
Two‑Stage Learning of Stabilizing Neural Controllers via Zubov Sampling and Iterative Domain Expansion
Haoyu Li · Xiangru Zhong · Bin Hu · Huan Zhang
Learning-based neural network (NN) control policies have shown impressive empirical performance. However, obtaining stability guarantees and estimates of the region of attraction of these learned neural controllers is challenging due to the lack of stable and scalable training and verification algorithms. Although previous works in this area have achieved great success, much conservatism remains in their frameworks. In this work, we propose a novel two-stage training framework to jointly synthesize a controller a Lyapunov function for continuous-time systems. By leveraging a Zubov‑inspired region of attraction characterization to directly estimate stability boundaries, we propose a novel training-data sampling strategy and a domain-updating mechanism that significantly reduces the conservatism in training. Moreover, unlike existing works on continuous-time systems that rely on an SMT solver to formally verify the Lyapunov condition, we extend state-of-the-art neural network verifier $\alpha,\beta$-CROWN with the capability of performing automatic bound propagation through the Jacobian of dynamical systems and a novel verification scheme that avoids expensive bisection. To demonstrate the effectiveness of our approach, we conduct numerical experiments by synthesizing and verifying controllers on several challenging nonlinear systems across multiple dimensions. We show that our training can yield region of attractions with volume $5 - 1.5\cdot 10^{5}$ times larger compared to the baselines, and our verification on continuous systems can be up to $40-10{,}000$ times faster compared to the traditional SMT solver dReal. Our code is available at https://github.com/Verified-Intelligence/Two-Stage_Neural_Controller_Training.
Matching Markets Meet LLMs: Algorithmic Reasoning with Ranked Preferences
Hadi Hosseini · Samarth Khanna · Ronak Singh
The rise of Large Language Models (LLMs) has driven progress in reasoning tasks, from program synthesis to scientific hypothesis generation, yet their ability to handle ranked preferences and structured algorithms in combinatorial domains remains underexplored. We study matching markets, a core framework behind applications like resource allocation and ride-sharing, which require reconciling individual ranked preferences to ensure stable outcomes. We evaluate seven state‐of‐the‐art models on a hierarchy of preference‐based reasoning tasks---ranging from stable‐matching generation to instability detection, instability resolution, and fine-grained preference queries---to systematically expose their logical and algorithmic limitations in handling ranked inputs. Surprisingly, even top-performing models with advanced reasoning struggle to resolve instability in large markets, often failing to identify blocking pairs or execute algorithms iteratively. We further show that parameter-efficient fine-tuning (LoRA) significantly improves performance in small markets, but fails to bring about a similar improvement on large instances, suggesting the need for more sophisticated strategies to improve LLMs' reasoning with larger-context inputs.
GUARD: Constructing Realistic Two-Player Matrix and Security Games for Benchmarking Game-Theoretic Algorithms
Noah Krever · Jakub Cerny · Moise Blanchard · Christian Kroer
Game-theoretic algorithms are commonly benchmarked on recreational games, classical constructs from economic theory such as congestion and dispersion games, or entirely random game instances. While the past two decades have seen the rise of security games -- grounded in real-world scenarios like patrolling and infrastructure protection -- their practical evaluation has been hindered by limited access to the datasets used to generate them. In particular, although the structural components of these games (e.g., patrol paths derived from maps) can be replicated, the critical data defining target values -- central to utility modeling -- remain inaccessible. In this paper, we introduce a flexible framework that leverages open-access datasets to generate realistic matrix and security game instances. These include animal movement data for modeling anti-poaching scenarios and demographic and infrastructure data for infrastructure protection. Our framework allows users to customize utility functions and game parameters, while also offering a suite of preconfigured instances. We provide theoretical results highlighting the degeneracy and limitations of benchmarking on random games, and empirically compare our generated games against random baselines across a variety of standard algorithms for computing Nash and Stackelberg equilibria, including linear programming, incremental strategy generation, and self-play with no-regret learners.
Parameter Dynamics of Online Machine Learning and Test-time Adaptation
Jae-Hong Lee
Pre-trained models based on deep neural networks hold strong potential for cross-domain adaptability. However, this potential is often impeded in online machine learning (OML) settings, where the breakdown of the independent and identically distributed (i.i.d.) assumption leads to unstable adaptation. While recent advances in test-time adaptation (TTA) have addressed aspects of this challenge under unsupervised learning, most existing methods focus exclusively on unsupervised objectives and overlook the risks posed by non-i.i.d. environments and the resulting dynamics of model parameters. In this work, we present a probabilistic framework that models the adaptation process using stochastic differential equations, enabling a principled analysis of parameter distribution dynamics over time. Within this framework, we find that the log-variance of the parameter transition distribution aligns closely with an inverse-gamma distribution under stable and high-performing adaptation conditions. Motivated by this insight, we propose Structured Inverse-Gamma Model Alignment (SIGMA), a novel algorithm that dynamically regulates parameter evolution to preserve inverse-gamma alignment throughout adaptation. Extensive experiments across diverse models, datasets, and adaptation scenarios show that SIGMA consistently enhances the performance of state-of-the-art TTA methods, highlighting the critical role of parameter dynamics in ensuring robust adaptation.
SPACE: SPike-Aware Consistency Enhancement for Test-Time Adaptation in Spiking Neural Networks
Xinyu Luo · Kecheng Chen · Pao-Sheng Sun · Chris Xing TIAN · Arindam Basu · Haoliang Li
Spiking Neural Networks (SNNs), as a biologically plausible alternative to Artificial Neural Networks (ANNs), have demonstrated advantages in terms of energy efficiency, temporal processing, and biological plausibility. However, SNNs are highly sensitive to distribution shifts, which can significantly degrade their performance in real-world scenarios. Traditional test-time adaptation (TTA) methods designed for ANNs often fail to address the unique computational dynamics of SNNs, such as sparsity and temporal spiking behavior. To address these challenges, we propose SPike-Aware Consistency Enhancement (SPACE), the first source-free and single-instance TTA method specifically designed for SNNs. SPACE leverages the inherent spike dynamics of SNNs to maximize the consistency of spike-behavior-based local feature maps across augmented versions of a single test sample, enabling robust adaptation without requiring source data. We evaluate SPACE on multiple datasets. Furthermore, SPACE exhibits robust generalization across diverse network architectures, consistently enhancing the performance of SNNs on CNNs, Transformer, and ConvLSTM architectures. Experimental results show that SPACE outperforms state-of-the-art ANN methods while maintaining lower computational cost, highlighting its effectiveness and robustness for SNNs in real-world settings. The code will be available at https://github.com/ethanxyluo/SPACE.
Vicinity-Guided Discriminative Latent Diffusion for Privacy-Preserving Domain Adaptation
Jing Wang · Wonho Bae · Jiahong Chen · Wenxu Wang · Junhyug Noh
Recent work on latent diffusion models (LDMs) has focused almost exclusively on generative tasks, leaving their potential for discriminative transfer largely unexplored. We introduce Discriminative Vicinity Diffusion (DVD), a novel LDM-based framework for a more practical variant of source-free domain adaptation (SFDA): the source provider may share not only a pre-trained classifier but also an auxiliary latent diffusion module, trained once on the source data and never exposing raw source samples. DVD encodes each source feature’s label information into its latent vicinity by fitting a Gaussian prior over its k-nearest neighbors and training the diffusion network to drift noisy samples back to label-consistent representations. During adaptation, we sample from each target feature’s latent vicinity, apply the frozen diffusion module to generate source-like cues, and use a simple InfoNCE loss to align the target encoder to these cues, explicitly transferring decision boundaries without source access. Across standard SFDA benchmarks, DVD outperforms state-of-the-art methods. We further show that the same latent diffusion module enhances the source classifier’s accuracy on in-domain data and boosts performance in supervised classification and domain generalization experiments. DVD thus reinterprets LDMs as practical, privacy-preserving bridges for explicit knowledge transfer, addressing a core challenge in source-free domain adaptation that prior methods have yet to solve. Code is available on our Github: https://github.com/JingWang18/DVD-SFDA.
Controlled Visual Hallucination via Thalamus-Driven Decoupling Network for Domain Adaptation of Black-Box Predictors
Yuwu Lu · Chunzhi Liu
Domain Adaptation of Black-box Predictors (DABP) transfers knowledge from a labeled source domain to an unlabeled target domain, without requiring access to either source data or source model. Common practices of DABP leverage reliable samples to suppress negative information about unreliable samples. However, there are still some problems: i) Excessive attention to reliable sample aggregation leads to premature overfitting; ii) Valuable information in unreliable samples is often overlooked. To address them, we propose a novel spatial learning approach, called Controlled Visual Hallucination via Thalamus-driven Decoupling Network (CVH-TDN). Specifically, CVH-TDN is the first work that introduces the thalamus-driven decoupling network in the visual task, relying on its connection with hallucination to control the direction of sample generation in feature space. CVH-TDN is composed of Hallucination Generation (HG), Hallucination Alignment (HA), and Hallucination Calibration (HC), aiming to explore the spatial relationship information between samples and hallucinations. Extensive experiments confirm that CVH-TDN achieves SOTA performance on four standard benchmarks.
Vertical Federated Feature Screening
Huajun Yin · Liyuan Wang · Yingqiu Zhu · Liping Zhu · Danyang Huang
With the rapid development of the big data era, Vertical Federated Learning (VFL) has been widely applied to enable data collaboration while ensuring privacy protection. However, the ultrahigh dimensionality of features and the sparse data structures inherent in large-scale datasets introduce significant computational complexity. In this paper, we propose the Vertical Federated Feature Screening (VFS) algorithm, which effectively reduces computational, communication, and encryption costs. VFS is a two-stage feature screening procedure that proceeds from coarse to fine: the first stage quickly filters out irrelevant feature groups, followed by a more refined screening of individual features. It significantly reduces the resource demands of downstream tasks such as secure joint modeling or federated feature selection. This efficiency is particularly beneficial in scenarios with ultrahigh feature dimensionality or severe class imbalance in the response variable. The statistical and computational properties of VFS are rigorously established. Numerical simulations and real-world applications demonstrate its superior performance.
CTSketch: Compositional Tensor Sketching for Scalable Neurosymbolic Learning
Seewon Choi · Alaia Solko-Breslin · Rajeev Alur · Eric Wong
Many computational tasks benefit from being formulated as the composition of neural networks followed by a discrete symbolic program. The goal of neurosymbolic learning is to train the neural networks using end-to-end input-output labels of the composite. We introduce CTSketch, a novel, scalable neurosymbolic learning algorithm. CTSketch uses two techniques to improve the scalability of neurosymbolic inference: decompose the symbolic program into sub-programs and summarize each sub-program with a sketched tensor. This strategy allows us to approximate the output distribution of the program with simple tensor operations over the input distributions and the sketches. We provide theoretical insight into the maximum approximation error. Furthermore, we evaluate CTSketch on benchmarks from the neurosymbolic learning literature, including some designed for evaluating scalability. Our results show that CTSketch pushes neurosymbolic learning to new scales that were previously unattainable, with neural predictors obtaining high accuracy on tasks with one thousand inputs, despite supervision only on the final output.
Exploiting Dynamic Sparsity in Einsum
Christoph Staudt · Mark Blacher · Tim Hoffmann · Kaspar Kasche · Olaf Beyersdorff · Joachim Giesen
Einsum expressions specify an output tensor in terms of several input tensors. They offer a simple yet expressive abstraction for many computational tasks in artificial intelligence and beyond. However, evaluating einsum expressions poses hard algorithmic problems that depend on the representation of the tensors. Two popular representations are multidimensional arrays and coordinate lists. The latter is a more compact representation for sparse tensors, that is, tensors where a significant proportion of the entries are zero. So far, however, most of the popular einsum implementations use the multidimensional array representation for tensors. Here, we show on a non-trivial example that, when evaluating einsum expressions, coordinate lists can be exponentially more efficient than multidimensional arrays. In practice, however, coordinate lists can also be significantly less efficient than multidimensional arrays, but it is hard to decide from the input tensors whether this will be the case. Sparsity evolves dynamically in intermediate tensors during the evaluation of an einsum expression. Therefore, we introduce a hybrid solution where the representation is switched on the fly from multidimensional arrays to coordinate lists depending on the sparsity of the remaining tensors. In our experiments on established benchmark einsum expressions, the hybrid solution is consistently competitive with or outperforms the better of the two static representations.
SHGR: A Generalized Maximal Correlation Coefficient
Samuel Stocksieker · Denys Pommeret
Traditional correlation measures, such as Pearson’s and Spearman’s coefficients, are limited in their ability to capture complex relationships, particularly nonlinear and multivariate dependencies. The Hirschfeld–Gebelein–Rényi (HGR) maximal correlation offers a powerful alternative by seeking the highest Pearson correlation attainable through nonlinear transformations of two random variables. However, estimating the HGR remains challenging due to the complexity of optimizing arbitrary nonlinear functions. We introduce a new coefficient inspired by the HGR but grounded in the Spearman rank correlation, which we call the Spearman HGR (SHGR). We propose a neural network-based estimator tailored to (i) bivariate correlation matrix, (ii) multivariate correlations between a set of variables and another one, and (iii) full correlation between two sets of variables. The SHGR satisfies Rényi’s axioms, effectively detects nonlinear dependencies, and demonstrates robustness to noise, outliers, and spurious correlations (hallucinations). Additionally, it achieves competitive computational efficiency through tailored neural architectures. Comprehensive Numerical experiments and feature selection tasks confirm that SHGR outperforms existing state-of-the-art methods.
Improving Monte Carlo Tree Search for Symbolic Regression
Zhengyao Huang · Daniel Huang · Tiannan Xiao · Dina Ma · Zhenyu Ming · Hao Shi · Yuanhui Wen
Symbolic regression aims to discover concise, interpretable mathematical expressions that satisfy desired objectives, such as fitting data, posing a highly combinatorial optimization problem. While genetic programming has been the dominant approach, recent efforts have explored reinforcement learning methods for improving search efficiency. Monte Carlo Tree Search (MCTS), with its ability to balance exploration and exploitation through guided search, has emerged as a promising technique for symbolic expression discovery. However, its traditional bandit strategies and sequential symbol construction often limit performance. In this work, we propose an improved MCTS framework for symbolic regression that addresses these limitations through two key innovations: (1) an extreme bandit allocation strategy tailored for identifying globally optimal expressions, with finite-time performance guarantees under polynomial reward decay assumptions; and (2) evolution-inspired state-jumping actions such as mutation and crossover, which enable non-local transitions to promising regions of the search space. These state-jumping actions also reshape the reward landscape during the search process, improving both robustness and efficiency. We conduct a thorough numerical study to the impact of these improvements and benchmark our approach against existing symbolic regression methods on a variety of datasets, including both ground-truth and black-box datasets. Our approach achieves competitive performance with state-of-the-art libraries in terms of recovery rate, attains favorable positions on the Pareto frontier of accuracy versus model complexity.
A$^3$E: Towards Compositional Model Editing
Hongming Piao · Hao Wang · Dapeng Wu · Ying Wei
Model editing has become a *de-facto* practice to address hallucinations and outdated knowledge of large language models (LLMs). However, existing methods are predominantly evaluated in isolation, i.e., one edit at a time, failing to consider a critical scenario of compositional model editing, where multiple edits must be integrated and jointly utilized to answer real-world multifaceted questions. For instance, in medical domains, if one edit informs LLMs that COVID-19 causes "fever" and another that it causes "loss of taste", a qualified compositional editor should enable LLMs to answer the question "What are the symptoms of COVID-19?" with both "fever" and "loss of taste" (and potentially more). In this work, we define and systematically benchmark this compositional model editing (CME) task, identifying three key undesirable issues that existing methods struggle with: *knowledge loss*, *incorrect preceding* and *knowledge sinking*. To overcome these issues, we propose A$^3$E, a novel compositional editor that (1) ***a**daptively combines and **a**daptively regularizes* pre-trained foundation knowledge in LLMs in the stage of edit training and (2) ***a**daptively merges* multiple edits to better meet compositional needs in the stage of edit composing. Extensive experiments demonstrate that A$^3$E improves the composability by at least 22.45\% without sacrificing the performance of non-compositional model editing.
Conformal Online Learning of Deep Koopman Linear Embeddings
Ben Gao · Jordan Patracone · Stephane Chretien · Olivier Alata
We introduce Conformal Online Learning of Koopman embeddings (COLoKe), a novel framework for adaptively updating Koopman-invariant representations of nonlinear dynamical systems from streaming data. Our modeling approach combines deep feature learning with multistep prediction consistency in the lifted space, where the dynamics evolve linearly. To prevent overfitting, COLoKe employs a conformal-style mechanism that shifts the focus from evaluating the conformity of new states to assessing the consistency of the current Koopman model. Updates are triggered only when the current model’s prediction error exceeds a dynamically calibrated threshold, allowing selective refinement of the Koopman operator and embedding. Empirical results on benchmark dynamical systems demonstrate the effectiveness of COLoKe in maintaining long-term predictive accuracy while significantly reducing unnecessary updates and avoiding overfitting.
Gradient-Guided Epsilon Constraint Method for Online Continual Learning
Song Lai · Changyi Ma · Fei Zhu · Zhe Zhao · Xi Lin · GAOFENG MENG · Qingfu Zhang
Online Continual Learning (OCL) requires models to learn sequentially from data streams with limited memory. Rehearsal-based methods, particularly Experience Replay (ER), are commonly used in OCL scenarios. This paper revisits ER through the lens of $\epsilon$-constraint optimization, revealing that ER implicitly employs a soft constraint on past task performance, with its weighting parameter post-hoc defining a slack variable. While effective, ER's implicit and fixed slack strategy has limitations: it can inadvertently lead to updates that negatively impact generalization, and its fixed trade-off between plasticity and stability may not optimally balance current streaming with memory retention, potentially overfitting to the memory buffer. To address these shortcomings, we propose the \textbf{G}radient-Guided \textbf{E}psilon \textbf{C}onstraint (\textbf{GEC}) method for online continual learning. GEC explicitly formulates the OCL update as an $\epsilon$-constraint optimization problem, which minimize the loss on the current task data and transform the stability objective as constraints and propose a gradient-guided method to dynamically adjusts the update direction based on whether the performance on memory samples violates a predefined slack tolerance $\bar{\varepsilon}$: if forgetting exceeds this tolerance, GEC prioritizes constraint satisfaction; otherwise, it focuses on the current task while controlling the rate of increase in memory loss. Empirical evaluations on standard OCL benchmarks demonstrate GEC's ability to achieve a superior trade-off, leading to improved overall performance.
HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models
Han Liu · Jiaqi Li · Zhi Xu · Xiaotong Zhang · Xiaoming Xu · Fenglong Ma · Yuanman Li · Hong Yu
Black-box adversarial attack on vision-language pre-trained models is a practical and challenging task, as text and image perturbations need to be considered simultaneously, and only the predicted results are accessible. Research on this problem is in its infancy, and only a handful of methods are available. Nevertheless, existing methods either rely on a complex iterative cross-search strategy, which inevitably consumes numerous queries, or only consider reducing the similarity of positive image-text pairs but ignore that of negative ones, which will also be implicitly diminished, thus inevitably affecting the attack performance. To alleviate the above issues, we propose a simple yet effective framework to generate high-quality adversarial examples on vision-language pre-trained models, named HQA-VLAttack, which consists of text and image attack stages. For text perturbation generation, it leverages the counter-fitting word vector to generate the substitute word set, thus guaranteeing the semantic consistency between the substitute word and the original word. For image perturbation generation, it first initializes the image adversarial example via the layer-importance guided strategy, and then utilizes contrastive learning to optimize the image adversarial perturbation, which ensures that the similarity of positive image-text pairs is decreased while that of negative image-text pairs is increased. In this way, the optimized adversarial images and texts are more likely to retrieve negative examples, thereby enhancing the attack success rate. Experimental results on three benchmark datasets demonstrate that HQA-VLAttack significantly outperforms strong baselines in terms of attack success rate.
Fourier Clouds: Fast Bias Correction for Imbalanced Semi-Supervised Learning
Jiawei Gu · Yidi Wang · Qingqiang Sun · Xinming Li · Xiao Luo · Ziyue Qiao
Pseudo-label-based Semi-Supervised Learning (SSL) often suffers from classifier bias, particularly under class imbalance, as inaccurate pseudo-labels tend to exacerbate existing biases towards majority classes. Existing methods, such as \textit{CDMAD}\cite{cdmad}, utilize simplistic reference inputs—typically uniform or blank-colored images—to estimate and correct this bias. However, such simplistic references fundamentally ignore realistic statistical information inherent to real datasets, specifically typical color distributions, texture details, and frequency characteristics. This lack of \emph{statistical representativeness} can lead the model to inaccurately estimate its inherent bias, limiting the effectiveness of bias correction, particularly under severe class imbalance or substantial distribution mismatches between labeled and unlabeled datasets. To overcome these limitations, we introduce the \textbf{FARAD} (Fourier-Adapted Reference for Accurate Debiasing) System. This system utilizes random-phase images, constructed by preserving the amplitude spectrum of real data while randomizing the phase spectrum. This strategy ensures two critical properties: (1) \textbf{Semantic Irrelevance}, as randomizing phase removes any structural or recognizable semantic cues, and (2) \textbf{Statistical Representativeness}, as preserving the amplitude spectrum maintains realistic textures, color distributions, and frequency characteristics. Grounded theoretically in classical Fourier analysis, the FARAD System provides a robust, accurate estimation of per-class biases. Furthermore, computational efficiency is enhanced through optimized real-to-complex (R2C) batched Fast Fourier Transforms (FFTs). Comprehensive experiments demonstrate that our approach, significantly improves minority-class accuracy and overall SSL performance, particularly under challenging imbalance scenarios, compared with existing reference-based bias correction methods.
You Can Trust Your Clustering Model: A Parameter-free Self-Boosting Plug-in for Deep Clustering
Hanyang Li · Yuheng Jia · Hui LIU · Junhui Hou
Recent deep clustering models have produced impressive clustering performance. However, a common issue with existing methods is the disparity between global and local feature structures. While local structures typically show strong consistency and compactness within class samples, global features often present intertwined boundaries and poorly separated clusters. Motivated by this observation, we propose **DCBoost, a parameter-free plug-in** designed to enhance the global feature structures of current deep clustering models. By harnessing reliable local structural cues, our method aims to elevate clustering performance effectively. Specifically, we first identify high-confidence samples through adaptive $k$-nearest neighbors-based consistency filtering, aiming to select a sufficient number of samples with high label reliability to serve as trustworthy anchors for self-supervision. Subsequently, these samples are utilized to compute a discriminative loss, which promotes both intra-class compactness and inter-class separability, to guide network optimization. Extensive experiments across various benchmark datasets showcase that our DCBoost significantly improves the clustering performance of diverse existing deep clustering models. Notably, our method improves the performance of current state-of-the-art baselines (e.g., ProPos) by more than 3\% and amplifies the silhouette coefficient by over $7\times$. **Code is available at [https://github.com/l-h-y168/DCBoost](https://github.com/l-h-y168/DCBoost).**
Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 25.6% to 27.3% for Qwen2.5-1.5B, and from 45.9% to 51.0% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 25.2% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training. Additionally, we have open-sourced both the trained models and the source code.
A Cramér–von Mises Approach to Incentivizing Truthful Data Sharing
Alex Clinton · Thomas Zeng · Yiding Chen · Jerry Zhu · Kirthevasan Kandasamy
Modern data marketplaces and data sharing consortia increasingly rely on incentive mechanisms to encourage agents to contribute data. However, schemes that reward agents based on the quantity of submitted data are vulnerable to manipulation, as agents may submit fabricated or low-quality data to inflate their rewards. Prior work has proposed comparing each agent’s data against others’ to promote honesty: when others contribute genuine data, the best way to minimize discrepancy is to do the same. Yet prior implementations of this idea rely on very strong assumptions about the data distribution (e.g. Gaussian), limiting their applicability. In this work, we develop reward mechanisms based on a novel, two-sample test statistic inspired by the Cramér-von Mises statistic. Our methods strictly incentivize agents to submit more genuine data, while disincentivizing data fabrication and other types of untruthful reporting. We establish that truthful reporting constitutes a (possibly approximate) Nash equilibrium in both Bayesian and prior-agnostic settings. We theoretically instantiate our method in two canonical data sharing problems and show that it relaxes key assumptions made by prior work. Empirically, we demonstrate that our mechanism incentivizes truthful data sharing via simulations and on real-world language and image data.
Certifying Concavity and Monotonicity in Games via Sum-of-Squares Hierarchies
Vincent Leon · Iosif Sakos · Ryann Sim · Antonios Varvitsiotis
Concavity and its refinements underpin tractability in multiplayer games, where players independently choose actions to maximize their own payoffs which depend on other players’ actions. In concave games, where players' strategy sets are compact and convex, and their payoffs are concave in their own actions, strong guarantees follow: Nash equilibria always exist and decentralized algorithms converge to equilibria. If the game is furthermore monotone, an even stronger guarantee holds: Nash equilibria are unique under strictness assumptions. Unfortunately, we show that certifying concavity or monotonicity is NP-hard, already for games where utilities are multivariate polynomials and compact, convex basic semialgebraic strategy sets—an expressive class that captures extensive-form games with imperfect recall. On the positive side, we develop two hierarchies of sum-of-squares programs that certify concavity and monotonicity of a given game, and each level of the hierarchies can be solved in polynomial time. We show that almost all concave/monotone games are certified at some finite level of the hierarchies. Subsequently, we introduce the classes of SOS-concave/monotone games, which globally approximate concave/monotone games, and show that for any given game we can compute the closest SOS-concave/monotone game in polynomial time. Finally, we apply our techniques to canonical examples of extensive-form games with imperfect recall.
Clustering via Hedonic Games: New Concepts and Algorithms
Gergely Csáji · Alexander Gundert · Jörg Rothe · Ildikó Schlotter
We study fundamental connections between coalition formation games and clustering, illustrating the cross-disciplinary relevance of these concepts. We focus on graphical hedonic games where agents' preferences are compactly represented by a friendship graph and an enemy graph. In the context of clustering, friendship relations naturally align with data point similarities, whereas enmity corresponds to dissimilarities. We consider two stability notions based on single-agent deviations: local popularity and local stability. Exploring these concepts from an algorithmic viewpoint, we design efficient mechanisms for finding locally stable or locally popular partitions. Besides gaining theoretical insight into the computational complexity of these problems, we perform simulations that demonstrate how our algorithms can be successfully applied in clustering and community detection. Our findings highlight the interplay between coalition formation games and data-driven clustering techniques, offering fresh perspectives and applications in both areas.
Learning from Delayed Feedback in Games via Extra Prediction
Yuma Fujimoto · Kenshi Abe · Kaito Ariu
This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of social regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted $n$ times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that social regret is constant in general-sum normal-form games, and the strategies last-iterate converge to the Nash equilibrium in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.
Beyond Last-Click: An Optimal Mechanism for Ad Attribution
Nan An · Weian Li · Qi Qi · Changyuan Yu · Liang Zhang
Accurate attribution for multiple platforms is critical for evaluating performance-based advertising. However, existing attribution methods rely heavily on the heuristic methods, e.g., Last-Click Mechanism (LCM) which always allocates the attribution to the platform with the latest report, lacking theoretical guarantees for attribution accuracy. In this work, we propose a novel theoretical model for the advertising attribution problem, in which we aim to design the optimal dominant strategy incentive compatible (DSIC) mechanisms and evaluate their performance. We first show that LCM is not DSIC and performs poorly in terms of accuracy and fairness. To address this limitation, we introduce the Peer-Validated Mechanism (PVM), a DSIC mechanism in which a platform's attribution depends solely on the reports of other platforms. We then examine the accuracy of PVM across both homogeneous and heterogeneous settings, and provide provable accuracy bounds for each case. Notably, we show that PVM is the optimal DSIC mechanism in the homogeneous setting. Finally, numerical experiments are conducted to show that PVM consistently outperforms LCM in terms of attribution accuracy and fairness.
Online Strategic Classification With Noise and Partial Feedback
Tianrun Zhao · Xiaojie Mao · Yong Liang
In this paper, we study an online strategic classification problem, where a principal aims to learn an accurate binary linear classifier from sequentially arriving agents. For each agent, the principal announces a classifier. The agent can strategically exercise costly manipulations on his features to be classified as the favorable positive class. The principal is unaware of the true feature-label distribution, but observes all reported features and only labels of positively classified agents. We assume that the true feature-label distribution is given by a halfspace model subject to arbitrary feature-dependent bounded noise (i.e., Massart Noise). This problem faces the combined challenges of agents' strategic feature manipulations, partial label observations, and label noises. We tackle these challenges by a novel learning algorithm. We show that the proposed algorithm yields classifiers that converge to the clairvoyant optimal one and attains a regret rate of $ O(\sqrt{T})$ up to poly-logarithmic and constant factors over $T$ cycles.
Multidimensional Bayesian Utility Maximization: Tight Approximations to Welfare
Kira Goldner · Taylor Lundy
We initiate the study of multidimensional Bayesian utility maximization, focusing on the unit-demand setting where values are i.i.d. across both items and buyers. The seminal result of Hartline and Roughgarden '08 studies simple, information-robust mechanisms that maximize utility for $n$ i.i.d. agents and $m$ identical items via an approximation to social welfare as an upper bound, and they prove this gap between optimal utility and social welfare is $\Theta(1+\log{n/m})$ in this setting. We extend these results to the multidimensional setting. To do so, we develop simple, prior-independent, approximately-optimal mechanisms, targeting the simplest benchmark of optimal welfare. We give a $(1-1/e)$-approximation when there are more items than buyers, and a $\Theta(\log{n/m})$-approximation when there are more buyers than items, and we prove that this bound is tight in both $n$ and $m$ by reducing the i.i.d. unit-demand setting to the identical items setting. Finally, we include an extensive discussion section on why Bayesian utility maximization is a promising research direction. In particular, we characterize complexities in this setting that defy our intuition from the welfare and revenue literature, and motivate why coming up with a better benchmark than welfare is a hard problem itself.
Strategic Classification with Non-Linear Classifiers
Benyamin Trachtenberg · Nir Rosenfeld
In strategic classification, the standard supervised learning setting is extended to support the notion of strategic user behavior in the form of costly feature manipulations made in response to a classifier. While standard learning supports a broad range of model classes, the study of strategic classification has, so far, been dedicated mostly to linear classifiers. This work aims to expand the horizon by exploring how strategic behavior manifests under non-linear classifiers and what this implies for learning. We take a bottom-up approach showing how non-linearity affects decision boundary points, classifier expressivity, and model class complexity. Our results show how, unlike the linear case, strategic behavior may either increase or decrease effective class complexity, and that the complexity decrease may be arbitrarily large. Another key finding is that universal approximators (e.g., neural nets) are no longer universal once the environment is strategic. We demonstrate empirically how this can create performance gaps even on an unrestricted model class.
Metritocracy: Representative Metrics for Lite Benchmarks
Ariel Procaccia · Ben Schiffer · Serena Wang · Shirley Zhang
A common problem in LLM evaluation is how to choose a subset of metrics from a full suite of possible metrics. Subset selection is usually done for efficiency or interpretability reasons, and the goal is often to select a "representative" subset of metrics. However, "representative" is rarely clearly defined. In this work, we use ideas from social choice theory to formalize two notions of representation for the selection of a subset of evaluation metrics. We first introduce positional representation, which guarantees every alternative is sufficiently represented at every position cutoff. We then introduce positional proportionality, which guarantees no alternative is proportionally over- or under-represented by more than a small error at any position. We prove upper and lower bounds on the smallest number of metrics needed to guarantee either of these properties in the worst case. We also study a generalized form of each property that allows for additional input on groups of metrics that must be represented. Finally, we tie theory to practice through real-world case studies on both LLM evaluation and hospital quality evaluation.
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks
Fanghui Liu · Leello Tadesse Dadi · Volkan Cevher
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks as the curse of dimensionality (CoD) cannot be evaded when trying to approximate even a single ReLU neuron. In this paper, we study a suitable function space for over-parameterized two-layer neural networks with bounded norms (e.g., the path norm, the Barron norm) in the perspective of sample complexity and generalization properties. First, we show that the path norm (as well as the Barron norm) is able to obtain width-independence sample complexity bounds, which allows for uniform convergence guarantees. Based on this result, we derive the improved result of metric entropy for $\epsilon$-covering up to $O(\epsilon^{-\frac{2d}{d+2}})$ ($d$ is the input dimension and the depending constant is at most polynomial order of $d$) via the convex hull technique, which demonstrates the separation with kernel methods with $\Omega(\epsilon^{-d})$ to learn the target function in a Barron space. Second, this metric entropy result allows for building a sharper generalization bound under a general moment hypothesis setting, achieving the rate at $O(n^{-\frac{d+2}{2d+2}})$. Our analysis is novel in that it offers a sharper and refined estimation for metric entropy (with a clear dependence relationship on the dimension $d$) and unbounded sampling in the estimation of the sample error and the output error.
Memory-Enhanced Neural Solvers for Routing Problems
Felix Chalumeau · Refiloe Shabe · Noah De Nicola · Arnu Pretorius · Tom Barrett · Nathan Grinsztajn
Routing Problems are central to many real-world applications, yet remain challenging due to their (NP-)hard nature. Amongst existing approaches, heuristics often offer the best trade-off between quality and scalability, making them suitable for industrial use. While Reinforcement Learning (RL) offers a flexible framework for designing heuristics, its adoption over handcrafted heuristics remains incomplete. Existing learned methods still lack the ability to adapt to specific instances and fully leverage the available computational budget. Current best methods either rely on a collection of pre-trained policies, or on RL fine-tuning; hence failing to fully utilize newly available information within the constraints of the budget. In response, we present MEMENTO, an approach that leverages memory to improve the search of neural solvers at inference. MEMENTO updates the action distribution dynamically based on the outcome of previous decisions. We validate its effectiveness on Traveling Salesman and Capacitated Vehicle Routing problems, demonstrating its superiority over tree-search and policy-gradient fine-tuning; and showing that it can be zero-shot combined with diversity-based solvers. We successfully train all RL auto-regressive solvers on large instances, and verify MEMENTO's scalability and data-efficiency: pushing the state-of-the-art on 11 out of 12 evaluated tasks.
Reconstruction and Secrecy under Approximate Distance Queries
Shay Moran · Elizaveta Nesterova
Consider the task of locating an unknown target point using approximate distance queries: in each round, a reconstructor selects a reference point and receives a noisy version of its distance to the target. This problem arises naturally in various contexts—from localization in GPS and sensor networks to privacy-aware data access—making it relevant from the perspective of both the reconstructor (seeking accurate recovery) and the responder (aiming to limit information disclosure, e.g., for privacy or security reasons). We study this reconstruction game through a learning-theoretic lens, focusing on the rate and limits of the best possible reconstruction error. Our first result provides a tight geometric characterization of the optimal error in terms of the Chebyshev radius, a classical concept from geometry. This characterization applies to all compact metric spaces (in fact, to all totally bounded spaces) and yields explicit formulas for natural subsets of the Euclidean metric. Our second result addresses the asymptotic behavior of reconstruction, distinguishing between pseudo-finite spaces, where the optimal error is attained after finitely many queries, and spaces where the approximation curve exhibits a nontrivial decay. We characterize pseudo-finiteness for convex subsets of Euclidean spaces.
A solvable model of learning generative diffusion: theory and insights
Hugo Cui · Cengiz Pehlevan · Yue Lu
In this manuscript, we analyze a solvable model of flow or diffusion-based generative model. We consider the problem of learning a model parametrized by a two-layer auto-encoder, trained with online stochastic gradient descent, on a high-dimensional target density with an underlying low-dimensional manifold structure. We derive a tight asymptotic characterization of low-dimensional projections of the distribution of samples generated by the learned model, ascertaining in particular its dependence on the number of training samples. Building on this analysis, we discuss how mode collapse can arise, and lead to model collapse when the generative model is re-trained on generated synthetic data.
On Agnostic PAC Learning in the Small Error Regime
Julian Asilis · Mikael Møller Høgsgaard · Grigoris Velegkas
Binary classification in the classic PAC model exhibits a curious phenomenon: Empirical Risk Minimization (ERM) learners are suboptimal in the realizable case yet optimal in the agnostic case. Roughly speaking, this owes itself to the fact that non-realizable distributions $\\mathcal{D}$ are more difficult to learn than realizable distributions -- even when one discounts a learner's error by $\\mathrm{err}(h^\\ast_\\mathcal{D})$, i.e., the error of the best hypothesis in $\\mathcal{H}$. Thus, optimal agnostic learners are permitted to incur excess error on (easier-to-learn) distributions $\\mathcal{D}$ for which $\\tau = \\mathrm{err}(h^\\ast_\\mathcal{D})$ is small. Recent work of Hanneke, Larsen, and Zhivotovskiy (FOCS '24) addresses this shortcoming by including $\\tau$ itself as a parameter in the agnostic error term. In this more fine-grained model, they demonstrate tightness of the error lower bound $\\tau + \\Omega \\left(\\sqrt{\\frac{\\tau (d + \\log(1 / \\delta))}{m}} + \\frac{d + \\log(1 / \\delta)}{m} \\right)$ in a regime where $\\tau > d/m$, and leave open the question of whether there may be a higher lower bound when $\\tau \\approx d/m$, with $d$ denoting $\\mathrm{VC}(\\mathcal{H})$. In this work, we resolve this question by exhibiting a learner which achieves error $c \\cdot \\tau + O \\left(\\sqrt{\\frac{\\tau (d + \\log(1 / \\delta))}{m}} + \\frac{d + \\log(1 / \\delta)}{m} \\right)$ for a constant $c \\leq 2.1$, matching the lower bound and demonstrating optimality when $\\tau =O( d/m)$. Further, our learner is computationally efficient and is based upon careful aggregations of ERM classifiers, making progress on two other questions of Hanneke, Larsen, and Zhivotovskiy (FOCS '24). We leave open the interesting question of whether our approach can be refined to lower the constant from 2.1 to 1, which would completely settle the complexity of agnostic learning.
Selective Omniprediction and Fair Abstention
Sílvia Casacuberta · Varun Kanade
We propose new learning algorithms for building selective classifiers, which are predictors that are allowed to abstain on some fraction of the domain. We study the model where a classifier may abstain from predicting at a fixed cost. Building on the recent framework on multigroup fairness and omniprediction, given a pre-specified class of loss functions, we provide an algorithm for building a single classifier that learns abstentions and predictions optimally for every loss in the entire class, where the abstentions are decided efficiently for each specific loss function by applying a fixed post-processing function. Our algorithm and theoretical guarantees generalize the previously-known algorithms for learning selective classifiers in formal learning-theoretic models. We then extend the traditional multigroup fairness algorithms to the selective classification setting and show that we can use a calibrated and multiaccurate predictor to efficiently build selective classifiers that abstain optimally not only globally but also locally within each of the groups in any pre-specified collection of possibly intersecting subgroups of the domain, and are also accurate when they do not abstain. We show how our abstention algorithms can be used as conformal prediction methods in the binary classification setting to achieve both marginal and group-conditional coverage guarantees for an intersecting collection of groups. We provide empirical evaluations for all of our theoretical results, demonstrating the practicality of our learning algorithms for abstaining optimally and fairly.
Statistical Guarantees for High-Dimensional Stochastic Gradient Descent
Jiaqi Li · Zhipeng Lou · Johannes Schmidt-Hieber · Wei Biao Wu
Stochastic Gradient Descent (SGD) and its Ruppert–Polyak averaged variant (ASGD) lie at the heart of modern large-scale learning, yet their theoretical properties in high-dimensional settings are rarely understood. In this paper, we provide rigorous statistical guarantees for constant learning-rate SGD and ASGD in high-dimensional regimes. Our key innovation is to transfer powerful tools from high-dimensional time series to online learning. Specifically, by viewing SGD as a nonlinear autoregressive process and adapting existing coupling techniques, we prove the geometric-moment contraction of high-dimensional SGD for constant learning rates, thereby establishing asymptotic stationarity of the iterates. Building on this, we derive the $q$-th moment convergence of SGD and ASGD for any $q\ge2$ in general $\ell^s$-norms, and, in particular, the $\ell^{\infty}$-norm that is frequently adopted in high-dimensional sparse or structured models. Furthermore, we provide sharp high-probability concentration analysis which entails the probabilistic bound of high-dimensional ASGD. Beyond closing a critical gap in SGD theory, our proposed framework offers a novel toolkit for analyzing a broad class of high-dimensional learning algorithms.
Principled Long-Tailed Generative Modeling via Diffusion Models
Pranoy Das · Kexin Fu · Abolfazl Hashemi · Vijay Gupta
Deep generative models, particularly diffusion models, have achieved remarkable success but face significant challenges when trained on real-world, long-tailed datasets- where few "head" classes dominate and many "tail" classes are underrepresented. This paper develops a theoretical framework for long-tailed learning via diffusion models through the lens of deep mutual learning. We introduce a novel regularized training objective that combines the standard diffusion loss with a mutual learning term, enabling balanced performance across all class labels, including the underrepresented tails. Our approach to learn via the proposed regularized objective is to formulate it as a multi-player game, with Nash equilibrium serving as the solution concept. We derive a non-asymptotic first-order convergence result for individual gradient descent algorithm to find the Nash equilibrium. We show that the Nash gap of the score network obtained from the algorithm is upper bounded by $\mathcal{O}(\frac{1}{\sqrt{T_{train}}}+\beta)$ where $\beta$ is the regularizing parameter and $T_{train}$ is the number of iterations of the training algorithm. Furthermore, we theoretically establish hyper-parameters for training and sampling algorithm that ensure that we find conditional score networks (under our model) with a worst case sampling error $\mathcal{O}(\epsilon+1), \forall \epsilon>0$ across all class labels. Our results offer insights and guarantees for training diffusion models on imbalanced, long-tailed data, with implications for fairness, privacy, and generalization in real-world generative modeling scenarios.
On the sample complexity of semi-supervised multi-objective learning
Tobias Wegel · Geelon So · Junhyung Park · Fanny Yang
In multi-objective learning (MOL), several possibly competing prediction tasks must be solved jointly by a single model. Achieving good trade-offs may require a model class $\mathcal{G}$ with larger capacity than what is necessary for solving the individual tasks. This, in turn, increases the statistical cost, as reflected in known MOL bounds that depend on the complexity of $\mathcal{G}$. We show that this cost is unavoidable for some losses, even in an idealized semi-supervised setting, where the learner has access to the Bayes-optimal solutions for the individual tasks as well as the marginal distributions over the covariates. On the other hand, for objectives defined with Bregman losses, we prove that the complexity of $\mathcal{G}$ may come into play only in terms of unlabeled data. Concretely, we establish sample complexity upper bounds, showing precisely when and how unlabeled data can significantly alleviate the need for labeled data. This is achieved by a simple pseudo-labeling algorithm.
Understanding Data Influence in Reinforcement Finetuning
Haoru Tan · Xiuzhe Wu · Sitong Wu · Shaofeng Zhang · Yanfeng Chen · Xingwu Sun · Jeanne Shen · Xiaojuan Qi
Reinforcement fine-tuning (RFT) is essential for enhancing the reasoning and generalization capabilities of large language models, but its success heavily relies on the quality of the training data. While data selection has been extensively studied in supervised learning, its role in reinforcement learning, particularly during the RFT stage, remains largely underexplored. In this work, we introduce RFT-Inf, the first influence estimator designed for data in reinforcement learning. RFT-Inf quantifies the importance of each training example by measuring how its removal affects the final training reward, offering a direct estimate of its contribution to model learning. To ensure scalability, we propose a first-order approximation of the RFT-Inf score by backtracking through the optimization process and applying temporal differentiation to the sample-wise influence term, along with a first-order Taylor approximation to adjacent time steps. This yields a lightweight, gradient-based estimator that evaluates the alignment between an individual sample’s gradient and the average gradient direction of all training samples, where a higher degree of alignment implies greater training utility. Extensive experiments demonstrate that RFT-Inf consistently improves reward performance and accelerates convergence in reinforcement fine-tuning.
Uni-RL: Unifying Online and Offline RL via Implicit Value Regularization
Haoran Xu · Liyuan Mao · Hui Jin · Weinan Zhang · Xianyuan Zhan · Amy Zhang
The practical use of reinforcement learning (RL) requires handling diverse settings, including online, offline, and offline-to-online learning. Instead of developing separate algorithms for each setting, we propose Uni-RL, a unified model-free RL framework that addresses all these scenarios within a single formulation. Uni-RL builds on the Implicit Value Regularization (IVR) framework and generalizes its dataset behavior constraint to the constraint w.r.t a reference policy, yielding an unified value learning objective for general settings. The reference policy is chosen to be the target policy in the online setting and the behavior policy in the offline setting. Using an iteratively refined behavior policy solves the over-constrained problem of directly applying IVR in the online setting, it provides an implicit trust-region style update through the value function while being off-policy. Uni-RL also introduces an unified policy extraction objective that estimates in-sample policy gradient using only actions from the reference policy. This supports various policy classes and theoretically guaranntees less value estimation error and larger performance improvement over the reference policy. We evaluate Uni-RL on a range of standard RL benchmarks across online, offline, and offline-to-online settings. In online RL, Uni-RL achieves higher sample efficiency than both off-policy methods without trust-region updates and on-policy methods with trust-region updates. In offline RL, Uni-RL retains the benefits of in-sample learning while outperforming IVR through better policy extraction. In offline-to-online RL, Uni-RL beats both constraint-based methods and unconstrained approaches by effectively balancing stability and adaptability.
Horizon Reduction Makes RL Scalable
Seohong Park · Kevin Frans · Deepinder Mann · Benjamin Eysenbach · Aviral Kumar · Sergey Levine
In this work, we study the scalability of offline reinforcement learning (RL) algorithms. In principle, a truly scalable offline RL algorithm should be able to solve any given problem, regardless of its complexity, given sufficient data, compute, and model capacity. We investigate if and how current offline RL algorithms match up to this promise on diverse, challenging, previously unsolved tasks, using datasets up to 1000× larger than typical offline RL datasets. We observe that despite scaling up data, many existing offline RL algorithms exhibit poor scaling behavior, saturating well below the maximum performance. We hypothesize that the horizon is the main cause behind the poor scaling of offline RL. We empirically verify this hypothesis through several analysis experiments, showing that long horizons indeed present a fundamental barrier to scaling up offline RL. We then show that various horizon reduction techniques substantially enhance scalability on challenging tasks. Based on our insights, we also introduce a minimal yet scalable method named SHARSA that effectively reduces the horizon. SHARSA achieves the best asymptotic performance and scaling behavior among our evaluation methods, showing that explicitly reducing the horizon unlocks the scalability of offline RL.
Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning
Wenlin Zhang · Xiangyang Li · Kuicai Dong · Yichao Wang · Pengyue Jia · Xiaopeng Li · Yingyi Zhang · Derong Xu · Zhaocheng Du · Huifeng Guo · Ruiming Tang · Xiangyu Zhao
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge, yet traditional RAG systems struggle with static workflows and limited adaptability for complex, multistep reasoning tasks. Agentic RAG systems, such as DeepResearch, address these issues through dynamic retrieval, iterative context refinement, and adaptive workflows. However, recent methods like Search-R1, which rely on outcome-based reinforcement learning, face challenges such as low exploration efficiency, gradient conflict, and sparse reward signals. To tackle these limitations, we introduce ReasonRAG, a novel method that leverages RAG-ProGUIDE—a high-quality dataset providing fine-grained, process-level rewards for query generation, evidence extraction, and answer generation. By employing process-supervised reinforcement learning, ReasonRAG enhances LLMs’ autonomous capabilities in search, query generation, evidence extraction, and answer synthesis. Experimental results show that ReasonRAG, utilizing RAG-ProGUIDE, outperforms existing approaches like Search-R1 and traditional RAG systems, achieving superior performance on five benchmark datasets with only 5k training instances—significantly fewer than the 90k required by Search-R1. Our code is available at https://github.com/Applied-Machine-Learning-Lab/ReasonRAG.
Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models
Jiajun Fan · Tong Wei · Chaoran Cheng · Yuxin Chen · Ge Liu
Balancing exploration and exploitation during reinforcement learning fine-tuning of generative models presents a critical challenge, as existing approaches rely on fixed divergence regularization that creates an inherent dilemma: strong regularization preserves model capabilities but limits reward optimization, while weak regularization enables greater alignment but risks instability or reward hacking. We introduce Adaptive Divergence Regularized Policy Optimization (ADRPO), which automatically adjusts regularization strength based on advantage estimates—reducing regularization for high-value samples while applying stronger regularization to poor samples, enabling policies to navigate between exploration and aggressive exploitation according to data quality. Our implementation with Wasserstein-2 regularization for flow matching generative models achieves remarkable results on text-to-image generation, achieving better semantic alignment and diversity than offline methods like DPO and online methods with fixed regularization like ORW-CFM-W2. ADRPO enables a 2B parameter SD3 model to surpass much larger models with 4.8B and 12B parameters in attribute binding, semantic consistency, artistic style transfer, and compositional control while maintaining generation diversity. ADRPO generalizes to KL-regularized fine-tuning of both text-only LLMs and multi-modal reasoning models, enhancing existing online RL methods like GRPO while requiring no additional networks or complex architectural changes. In LLM fine-tuning, ADRPO demonstrates an emergent ability to escape local optima through active exploration, while in multi-modal audio reasoning, it outperforms GRPO through superior step-by-step reasoning, enabling a 7B model to outperform substantially larger commercial models including Gemini 2.5 Pro and GPT-4o Audio, offering an effective plug-and-play solution to the exploration-exploitation challenge across diverse generative architectures and modalities.
Multi-Objective Reinforcement Learning with Max-Min Criterion: A Game-Theoretic Approach
woohyeon Byeon · Giseung Park · Jongseong Chae · Amir Leshem · Youngchul Sung
In this paper, we propose a provably convergent and practical framework for multi-objective reinforcement learning with max-min criterion. From a game-theoretic perspective, we reformulate max-min multi-objective reinforcement learning as a two-player zero-sum regularized continuous game and introduce an efficient algorithm based on mirror descent. Our approach simplifies the policy update while ensuring global last-iterate convergence. We provide a comprehensive theoretical analysis on our algorithm, including iteration complexity under both exact and approximate policy evaluations, as well as sample complexity bounds. To further enhance performance, we modify the proposed algorithm with adaptive regularization. Our experiments demonstrate the convergence behavior of the proposed algorithm in tabular settings, and our implementation for deep reinforcement learning significantly outperforms previous baselines in many MORL environments.
Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Reinforcement Learning
Younggyo Seo · Pieter Abbeel
Predicting a sequence of actions has been crucial in the success of recent behavior cloning algorithms in robotics. Can similar ideas improve reinforcement learning (RL)? We answer affirmatively by observing that incorporating action sequences when predicting ground-truth return-to-go leads to lower validation loss. Motivated by this, we introduce Coarse-to-fine Q-Network with Action Sequence (CQN-AS), a novel value-based RL algorithm that learns a critic network that outputs Q-values over a sequence of actions, i.e., explicitly training the value function to learn the consequence of executing action sequences. Our experiments show that CQN-AS outperforms several baselines on a variety of sparse-reward humanoid control and tabletop manipulation tasks from BiGym and RLBench.
LLM-Explorer: A Plug-in Reinforcement Learning Policy Exploration Enhancement Driven by Large Language Models
Qianyue Hao · Yiwen Song · Qingmin Liao · Jian Yuan · Yong Li
Policy exploration is critical in reinforcement learning (RL), where existing approaches include $\epsilon$-greedy, Gaussian process, etc. However, these approaches utilize preset stochastic processes and are indiscriminately applied in all kinds of RL tasks without considering task-specific features that influence policy exploration. Moreover, during RL training, the evolution of such stochastic processes is rigid, which typically only incorporates a decay in the variance, failing to adjust flexibly according to the agent's real-time learning status. Inspired by the analyzing and reasoning capability of large language models (LLMs), we design **LLM-Explorer** to adaptively generate task-specific exploration strategies with LLMs, enhancing the policy exploration in RL. In our design, we sample the learning trajectory of the agent during the RL training in a given task and prompt the LLM to analyze the agent's current policy learning status and then generate a probability distribution for future policy exploration. Updating the probability distribution periodically, we derive a stochastic process specialized for the particular task and dynamically adjusted to adapt to the learning process. Our design is a plug-in module compatible with various widely applied RL algorithms, including the DQN series, DDPG, TD3, and any possible variants developed based on them. Through extensive experiments on the Atari and MuJoCo benchmarks, we demonstrate LLM-Explorer's capability to enhance RL policy exploration, achieving an average performance improvement up to 37.27%. Our code is open-source at https://github.com/tsinghua-fib-lab/LLM-Explorer for reproducibility.
Improving Regret Approximation for Unsupervised Dynamic Environment Generation
Harry Mead · Bruno Lacerda · Jakob Foerster · Nick Hawes
Unsupervised Environment Design (UED) seeks to automatically generate training curricula for reinforcement learning (RL) agents, with the goal of improving generalisation and zero-shot performance. However, designing effective curricula remains a difficult problem, particularly in settings where small subsets of environment parameterisations result in significant increases in the complexity of the required policy. Current methods struggle with a difficult credit assignment problem and rely on regret approximations that fail to identify challenging levels, both of which are compounded as the size of the environment grows. We propose Dynamic Environment Generation for UED (DEGen) to enable a denser level generator reward signal, reducing the difficulty of credit assignment and allowing for UED to scale to larger environment sizes. We also introduce a new regret approximation, Maximised Negative Advantage (MNA), as a significantly improved metric to optimise for, that better identifies more challenging levels. We show empirically that MNA outperforms current regret approximations and when combined with DEGen, consistently outperforms existing methods, especially as the size of the environment grows. We have made all our code available here: \url{https://github.com/HarryMJMead/Dynamic-Environment-Generation-for-UED}.
Continuous Q-Score Matching: Diffusion Guided Reinforcement Learning for Continuous-Time Control
Chengxiu HUA · Jiawen Gu · Yushun Tang
Reinforcement learning (RL) has achieved significant success across a wide range of domains, however, most existing methods are formulated in discrete time. In this work, we introduce a novel RL method for continuous-time control, where stochastic differential equations govern state-action dynamics. Departing from traditional value function-based approaches, our key contribution is the characterization of continuous-time Q-functions via a martingale condition and the linking of diffusion policy scores to the action gradient of a learned continuous Q-function by the dynamic programming principle. This insight motivates Continuous Q-Score Matching (CQSM), a score-based policy improvement algorithm. Notably, our method addresses a long-standing challenge in continuous-time RL: preserving the action-evaluation capability of Q-functions without relying on time discretization. We further provide theoretical closed-form solutions for linear-quadratic (LQ) control problems within our framework. Numerical results in simulated environments demonstrate the effectiveness of our proposed method and compare it to popular baselines.
Minimax Adaptive Online Nonparametric Regression over Besov spaces
Paul Liautaud · Pierre Gaillard · Olivier Wintenberger
We study online adversarial regression with convex losses against a rich class of continuous yet highly irregular competitor functions,% prediction rules, modeled by Besov spaces $B_{pq}^s$ with general parameters $1 \leq p,q \leq \infty$ and smoothness $s > \tfrac{d}{p}$. We introduce an adaptive wavelet-based algorithm that performs sequential prediction without prior knowledge of $(s,p,q)$, and establish minimax-optimal regret bounds against any comparator in $B_{pq}^s$. We further design a locally adaptive extension capable of sequentially adapting to spatially inhomogeneous smoothness. This adaptive mechanism adjusts the resolution of the predictions over both time and space, yielding refined regret bounds in terms of local regularity. Consequently, in heterogeneous environments, our adaptive guarantees can significantly surpass those obtained by standard global methods.
Tractable Multinomial Logit Contextual Bandits with Non-Linear Utilities
Taehyun Hwang · Dahngoon Kim · Min-hwan Oh
We study the _multinomial logit_ (MNL) contextual bandit problem for sequential assortment selection. Although most existing research assumes utility functions to be linear in item features, this linearity assumption restricts the modeling of intricate interactions between items and user preferences. A recent work (Zhang & Luo, 2024) has investigated general utility function classes, yet its method faces fundamental trade-offs between computational tractability and statistical efficiency. To address this limitation, we propose a computationally efficient algorithm for MNL contextual bandits leveraging the upper confidence bound principle, specifically designed for non-linear parametric utility functions, including those modeled by neural networks. Under a realizability assumption and a mild geometric condition on the utility function class, our algorithm achieves a regret bound of $\tilde{\mathcal{O}}(\sqrt{T})$, where $T$ denotes the total number of rounds. Our result establishes that sharp $\tilde{\mathcal{O}}(\sqrt{T})$-regret is attainable even with neural network-based utilities, without relying on strong assumptions such as neural tangent kernel approximations. To the best of our knowledge, our proposed method is the first computationally tractable algorithm for MNL contextual bandits with non-linear utilities that provably attains $\tilde{\mathcal{O}}(\sqrt{T})$ regret. Comprehensive numerical experiments validate the effectiveness of our approach, showing robust performance not only in realizable settings but also in scenarios with model misspecification.
Non-Uniform Multiclass Learning with Bandit Feedback
Steve Hanneke · Amirreza Shaeiri · Hongao Wang
We study the problem of multiclass learning with bandit feedback in both the i.i.d. batch and adversarial online models. In the *uniform* learning framework, it is well known that no hypothesis class $\mathcal{H}$ is learnable in either model when the effective number of labels is unbounded. In contrast, within the *universal* learning framework, recent works by (Hanneke et al., 2025b) and (Hanneke et al., 2025a) have established surprising exact equivalences between learnability under bandit feedback and full supervision in both the i.i.d. batch and adversarial online models, respectively. This raises a natural question: What happens in the *non-uniform* learning framework, which lies between the uniform and universal learning frameworks? Our contributions are twofold: (1) We provide a combinatorial characterization of learnable hypothesis classes in both models, in the realizable and agnostic settings, within the non-uniform learning framework. Notably, this includes elementary and natural hypothesis classes, such as a countably infinite collection of constant functions over some domain that is learnable in both models. (2) We construct a hypothesis class that is non-uniformly learnable under full supervision in the adversarial online model (and thus also in the i.i.d. batch model), but not non-uniformly learnable under bandit feedback in the i.i.d. batch model (and thus also not in the adversarial online model). This serves as our main novel technical contribution that reveals a fundamental distinction between the non-uniform and universal learning frameworks.
Generalizing while preserving monotonicity in comparison-based preference learning models
Julien Fageot · Peva Blanchard · Gilles Bareilles · Lê-Nguyên Hoang
If you tell a learning model that you prefer an alternative $a$ over another alternative $b$, then you probably expect the model to be *monotone*, that is, the valuation of $a$ increases, and that of $b$ decreases. Yet, perhaps surprisingly, many widely deployed comparison-based preference learning models, including large language models, fail to have this guarantee. Until now, the only comparison-based preference learning algorithms that were proved to be monotone are the Generalized Bradley-Terry models. Yet, these models are unable to generalize to uncompared data. In this paper, we advance the understanding of the set of models with generalization ability that are *monotone*. Namely, we propose a new class of Linear Generalized Bradley-Terry models with Diffusion Priors, and identify sufficient conditions on alternatives' embeddings that guarantee monotonicity. Our experiments show that this monotonicity is far from being a general guarantee, and that our new class of generalizing models improves accuracy, especially when the dataset is limited.
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing
Adel Javanmard · Rudrajit Das · Alessandro Epasto · Vahab Mirrokni
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model’s performance. While prior works have demonstrated the benefits of specific heuristic retraining schemes, the question of how to optimally combine the model's predictions and the provided labels remains largely open. This paper addresses this fundamental question for binary classification tasks. We develop a principled framework based on approximate message passing (AMP) to analyze iterative retraining procedures for two ground truth settings: Gaussian mixture model (GMM) and generalized linear model (GLM). Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels, which when used to retrain the same model, minimizes its prediction error. We also quantify the performance of this optimal retraining strategy over multiple rounds. We complement our theoretical results by proposing a practically usable version of the theoretically-optimal aggregator function and demonstrate its superiority over baseline methods under different label noise models.
Robust Minimax Boosting with Performance Guarantees
Santiago Mazuelas · Veronica Alvarez
Boosting methods often achieve excellent classification accuracy, but can experience notable performance degradation in the presence of label noise. Existing robust methods for boosting provide theoretical robustness guarantees for certain types of label noise, and can exhibit only moderate performance degradation. However, previous theoretical results do not account for realistic types of noise and finite training sizes, and existing robust methods can provide unsatisfactory accuracies, even without noise. This paper presents methods for robust minimax boosting (RMBoost) that minimize worst-case error probabilities and are robust to general types of label noise. In addition, we provide finite-sample performance guarantees for RMBoost with respect to the error obtained without noise and with respect to the best possible error (Bayes risk). The experimental results corroborate that RMBoost is not only resilient to label noise but can also provide strong classification accuracy.
Improved Robust Estimation for Erdős-Rényi Graphs: The Sparse Regime and Optimal Breakdown Point
Hongjie Chen · Jingqiu Ding · Yiding Hua · Stefan Tiegel
We study the problem of robustly estimating the edge density of Erdos Renyi random graphs $\mathbb{G}(n, d^\circ/n)$ when an adversary can arbitrarily add or remove edges incident to an $\eta$-fraction of the nodes. We develop the first polynomial-time algorithm for this problem that estimates $d^\circ$ up to an additive error $O\left({[\sqrt{\log(n) / n} + \eta\sqrt{\log(1/\eta)} ] \cdot \sqrt{d^\circ} + \eta \log(1/\eta)}\right)$. Our error guarantee matches information-theoretic lower bounds up to factors of $\log(1/\eta)$. Moreover, our estimator works for all $d^\circ \geq \Omega(1)$ and achieves optimal breakdown point $\eta = 1/2$. Previous algorithms [Acharya et al 2022, Chen et al 2024], including inefficient ones, incur significantly suboptimal errors. Furthermore, even admitting suboptimal error guarantees, only inefficient algorithms achieve optimal breakdown point. Our algorithm is based on the sum-of-squares (SoS) hierarchy. A key ingredient is to construct constant-degree SoS certificates for concentration of the number of edges incident to small sets in $\mathbb{G}(n, d^\circ/n)$. Crucially, we show that these certificates also exist in the sparse regime, when $d^\circ = o(\log n)$, a regime in which the performance of previous algorithms was significantly suboptimal.
Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data
Chen Fan · Mark Schmidt · Christos Thrampoulidis
Different gradient-based methods for optimizing overparameterized models can all achieve zero training error yet converge to distinctly different solutions inducing different generalization properties. We provide the first complete characterization of implicit optimization bias for p-norm normalized steepest descent (NSD) and momentum steepest descent (NMD) algorithms in multi-class linear classification with cross-entropy loss. Our key theoretical contribution is proving that these algorithms converge to solutions maximizing the margin with respect to the classifier matrix's p-norm, with established convergence rates. These results encompass important special cases including Spectral Descent and Muon, which we show converge to max-margin solutions with respect to the spectral norm. A key insight of our contribution is that the analysis of general entry-wise and Schatten p-norms can be reduced to the analysis of NSD/NMD with max-norm by exploiting a natural ordering property between all p-norms relative to the max-norm and its dual sum-norm. For the specific case of descent with respect to the max-norm, we further extend our analysis to include preconditioning, showing that Adam converges to the matrix's max-norm solution. Our results demonstrate that the multi-class linear setting, which is inherently richer than the binary counterpart, provides the most transparent framework for studying implicit biases of matrix-parameter optimization algorithms.
Finite Sample Analyses for Continuous-time Linear Systems: System Identification and Online Control
Hongyi Zhou · Jingwei Li · Jingzhao Zhang
Real world evolves in continuous time but computations are done from finite samples. Therefore, we study algorithms using finite observations in continuous-time linear dynamical systems. We first study the system identification problem, and propose a first non-asymptotic error analysis with finite observations. Our algorithm identifies system parameters without needing integrated observations over certain time intervals, making it more practical for real-world applications. Further we propose a lower bound result that shows our estimator is provably optimal up to constant factors. Moreover, we apply the above algorithm to online control regret analysis for continuous-time linear system. Our system identification method allows us \textcolor{blue}{to} explore more efficiently, enabling the swift detection of ineffective policies. We achieve a regret of $\mathcal{O}(\sqrt{T})$ over a single $T$-time horizon in a controllable system, requiring only $\mathcal{O}(T)$ observations of the system.
Breaking the Performance Ceiling in Reinforcement Learning requires Inference Strategies
Felix Chalumeau · Daniel Rajaonarivonivelomanantsoa · Ruan John de Kock · Juan Formanek · Sasha Abramowitz · Omayma Mahjoub · Wiem Khlifi · Simon Du Toit · Louay Nessir · Refiloe Shabe · Arnol Fokam · Siddarth Singh · Ulrich Armel Mbou Sob · Arnu Pretorius
Reinforcement learning (RL) systems have countless applications, from energy-grid management to protein design. However, such real-world scenarios are often extremely difficult, combinatorial in nature, and require complex coordination between multiple agents. This level of complexity can cause even state-of-the-art RL systems, trained until convergence, to hit a performance ceiling which they are unable to break out of with zero-shot inference. Meanwhile, many digital or simulation-based applications allow for an inference phase that utilises a specific time and compute budget to explore multiple attempts before outputting a final solution. In this work, we show that such an inference phase employed at execution time, and the choice of a corresponding inference strategy, are key to breaking the performance ceiling observed in complex multi-agent RL problems. Our main result is striking: we can obtain up to a 126% and, on average, a 45% improvement over the previous state-of-the-art across 17 tasks, using only a couple seconds of extra wall-clock time during execution. We also demonstrate promising compute scaling properties, supported by over 60k experiments, making it the largest study on inference strategies for complex RL to date. We make all of our experimental data and code available.
Dynamic Algorithm for Explainable $k$-medians Clustering under $\ell_p$ Norm
Konstantin Makarychev · Ilias Papanikolaou · Liren Shan
We study the problem of explainable $k$-medians clustering introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (2020). In this problem, the goal is to construct a threshold decision tree that partitions data into $k$ clusters while minimizing the $k$-medians objective. These trees are interpretable because each internal node makes a simple decision by thresholding a single feature, allowing users to trace and understand how each point is assigned to a cluster. We present the first algorithm for explainable $k$-medians under $\ell_p$ norm for every finite $p \geq 1$. Our algorithm achieves an $\tilde{O}\big(p(\log k)^{1 + 1/p - 1/p^2}\big)$ approximation to the optimal $k$-medians cost for any $p \geq 1$. Previously, algorithms were known only for $p = 1$ and $p = 2$. For $p = 2$, our algorithm improves upon the existing bound of $\tilde O(\log^{3/2}k)$, and for $p = 1$, it matches the tight bound of $\log k + O(1)$ up to a multiplicative $O(\log \log k)$ factor. We show how to implement our algorithm in a dynamic setting. The dynamic algorithm maintains an explainable clustering under a sequence of insertions and deletions, with amortized update time $O(d \log^3 k)$ and $O(\log k)$ recourse, making it suitable for large-scale and evolving datasets.
Learning Preferences without Interaction for Cooperative AI: A Hybrid Offline-Online Approach
Haitong Ma · Haoran Yu · Haobo Fu · Shuai Li
Reinforcement learning (RL) for collaborative agents capable of cooperating with humans to accomplish tasks has long been a central goal in the RL community. While prior approaches have made progress in adapting collaborative agents to diverse human partners, they often focus solely on optimizing task performance and overlook human preferences—despite the fact that such preferences often diverge from the reward-maximization objective of the environment. Addressing this discrepancy poses significant challenges: humans typically provide only a small amount of offline, preference-related feedback and are unable to engage in online interactions, resulting in a distributional mismatch between the agent’s online learning process and the offline human data. To tackle this, we formulate the problem as an online&offline reinforcement learning problem that jointly integrates online generalization and offline preference learning, entirely under an offline training regime. We propose a simple yet effective training framework built upon existing RL algorithms that alternates between offline preference learning and online generalization recovery, ensuring the stability and alignment of both learning objectives. We evaluate our approach on a benchmark built upon the Overcooked environment—a standard environment for human-agent collaboration—and demonstrate remarkable performance across diverse preference styles and cooperative scenarios.
Time Reversal Symmetry for Efficient Robotic Manipulations in Deep Reinforcement Learning
Yunpeng Jiang · Jianshu Hu · Paul Weng · Yutong Ban
Symmetry is pervasive in robotics and has been widely exploited to improve sample efficiency in deep reinforcement learning (DRL). However, existing approaches primarily focus on spatial symmetries—such as reflection, rotation, and translation—while largely neglecting temporal symmetries. To address this gap, we explore time reversal symmetry, a form of temporal symmetry commonly found in robotics tasks such as door opening and closing. We propose Time Reversal symmetry enhanced Deep Reinforcement Learning (TR-DRL), a framework that combines trajectory reversal augmentation and time reversal guided reward shaping to efficiently solve temporally symmetric tasks. Our method generates reversed transitions from fully reversible transitions, identified by a proposed dynamics-consistent filter, to augment the training data. For partially reversible transitions, we apply reward shaping to guide learning, according to successful trajectories from the reversed task. Extensive experiments on the Robosuite and MetaWorld benchmarks demonstrate that TR-DRL is effective in both single-task and multi-task settings, achieving higher sample efficiency and stronger final performance compared to baseline methods.
$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training
Jin Zhou · Kaiwen Wang · Jonathan Chang · Zhaolin Gao · Nathan Kallus · Kilian Weinberger · Kianté Brantley · Wen Sun
Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce $Q\sharp$, a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function. We propose to learn the optimal $Q$ function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized $Q$-values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, $Q\sharp$ outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance. In sum, our results highlight $Q\sharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at \url{https://github.com/jinpz/q_sharp}.
Training a Scientific Reasoning Model for Chemistry
Siddharth Narayanan · James Braza · Ryan-Rhys Griffiths · Albert Bou · Geemi Wellawatte · Mayk Caldas Ramos · Ludovico Mitchener · Michael Pieler · Sam Rodriques · Andrew White
Reasoning models are large language models that use extra "thought tokens" before answering, providing both higher accuracy and explicit reasoning for their response. A major question has been whether language model reasoning generalizes beyond mathematics, programming, and logic, where most previous work has focused. We demonstrate that reasoning models can be post-trained in scientific domains without additional domain pretraining, and require substantially less data compared to contemporary domain-specific models. We report ether0, a 24B parameter LLM (based on Mistral-Small-24B) that can reason in natural language and respond with chemical structures. This reasoning model was trained with reinforcement learning on 577,790 experimentally-grounded chemistry tasks involving synthesized organic molecules. Our model outperforms all previous general-purpose chemistry models, frontier models, and humans, and is more data efficient relative to specialized models. We anticipate that this method can be applied to train highly data-efficient language models specialized for predictive and generative tasks across a wide variety of scientific domains.
A Bayesian Fast-Slow Framework to Mitigate Interference in Non-Stationary Reinforcement Learning
Yihuan Mao · Chongjie Zhang
Given the ever-changing nature of the world and its inhabitants, agents must possess the ability to adapt and evolve over time. Recent research in Given the ever-changing nature of the world and its inhabitants, agents must possess the ability to adapt and evolve over time. Recent research in non-stationary MDPs has focused on addressing this challenge, providing algorithms inspired by task inference techniques. However, these methods ignore the detrimental effects of interference, which particularly harm performance in contradictory tasks, leading to low efficiency in some environments. To address this issue, we propose a Bayesian Fast-Slow Framework (BFSF) that tackles both cross-task generalization and resistance to cross-task interference. Our framework consists of two components: a 'fast' policy, learned from recent data, and a 'slow' policy, learned through meta-reinforcement learning (meta-RL) using data from all previous tasks. A Bayesian estimation mechanism determines the current choice of 'fast' or 'slow' policy, balancing exploration and exploitation. Additionally, in the 'fast' policy, we introduce a dual-reset mechanism and a data relabeling technique to further accelerate convergence when encountering new tasks. Experiments demonstrate that our algorithm effectively mitigates interference and outperforms baseline approaches.
The Price of Sparsity: Sufficient Conditions for Sparse Recovery using Sparse and Sparsified Measurements
Youssef Chaabouni · David Gamarnik
We consider the problem of recovering the support of a sparse signal using noisy projections. While extensive work has been done on the dense measurement matrix setting, the sparse setting remains less explored. In this work, we establish sufficient conditions on the sample size for successful sparse recovery using sparse measurement matrices. Bringing together our result with previously known necessary conditions, we discover that, in the regime where $ds/p \rightarrow +\infty$, sparse recovery using a sparse design exhibits a phase transition at an information-theoretic threshold of $n_{\text{INF}}^{\text{SP}} = \Theta\left(s\log\left(p/s\right)/\log\left(ds/p\right)\right)$ for the number of measurements, where \(p\) denotes the signal dimension, $s$ the number of non-zero components of the signal, and $d$ the expected number of non-zero components per row of measurement. This expression makes the price of sparsity explicit: restricting each measurement to $d$ non‑zeros inflates the required sample size by a factor of $\log{s}/\log\left(ds/p\right)$, revealing a precise trade‑off between sampling complexity and measurement sparsity. Additionally, we examine the effect of sparsifying an originally dense measurement matrix on sparse signal recovery. We prove in the regime of $s = \alpha p$ and $d = \psi p$ with $\alpha, \psi \in \left(0,1\right)$ and $\psi$ small that a sample of size $n^{\text{Sp-ified}}_{\text{INF}} = \Theta\left(p / \psi^2\right)$ is sufficient for recovery, subject to a certain uniform integrability conjecture, the proof of which is work in progress.
Kernel Regression in Structured Non-IID Settings: Theory and Implications for Denoising Score Learning
Dechen Zhang · Zhenmei Shi · Zhang · Yingyu Liang · Difan Zou
Kernel ridge regression (KRR) is a foundational tool in machine learning, with recent work emphasizing its connections to neural networks. However, existing theory primarily addresses the i.i.d. setting, while real-world data often exhibits structured dependencies - particularly in applications like denoising score learning where multiple noisy observations derive from shared underlying signals. We present the first systematic study of KRR generalization for non-i.i.d. data with signal-noise causal structure, where observations represent different noisy views of common signals. Under standard spectral decay assumptions, we develop a novel blockwise decomposition method that enables precise concentration analysis for dependent data. Using this technique, we derive excess risk bounds for KRR that explicitly depend on: (1) the kernel spectrum, (2) causal structure parameters, and (3) sampling mechanisms (including relative sample sizes for signals and noises). We further apply our results to denoising score learning, establishing generalization guarantees and providing principled guidance for sampling noisy data points. This work advances KRR theory while providing practical tools for analyzing dependent data in modern machine learning applications.
Tradeoffs between Mistakes and ERM Oracle Calls in Online and Transductive Online Learning
Idan Attias · Steve Hanneke · Arvind Ramaswami
We study online and transductive online learning in settings where the learner can interact with the concept class only via Empirical Risk Minimization (ERM) or weak consistency oracles on arbitrary subsets of the instance domain. This contrasts with standard online models, where the learner has full knowledge of the concept class. The ERM oracle returns a hypothesis that minimizes the loss on a given subset, while the weak consistency oracle returns only a binary signal indicating whether the subset is realizable by a concept in the class. The learner's performance is measured by the number of mistakes and oracle calls. In the standard online setting with ERM access, we establish tight lower bounds in both the realizable and agnostic cases: $\Omega(2^{d_\mathrm{LD}})$ mistakes and $\Omega(\sqrt{T 2^{d_\mathrm{LD}}})$ regret, respectively, where $T$ is the number of timesteps and $d_\mathrm{LD}$ is the Littlestone dimension of the class. We further show how existing results for online learning with ERM access translate to the setting with a weak consistency oracle, at the cost of increasing the number of oracle calls by $O(T)$. We then consider the transductive online model, where the instance sequence is known in advance but labels are revealed sequentially. For general Littlestone classes, we show that the optimal mistake bound in the realizable case and in the agnostic case can be achieved using $O(T^{d_\mathrm{VC}+1})$ weak consistency oracle calls, where $d_\mathrm{VC}$ is the VC dimension of the class. On the negative side, we show that $\Omega(T)$ weak consistency queries are necessary for transductive online learnability, and that $\Omega(T)$ ERM queries are necessary to avoid exponential dependence on the Littlestone dimension. % Finally, for special families of concept classes, we demonstrate how to reduce the number of oracle calls using randomized algorithms while maintaining similar mistake bounds. In particular, for Thresholds on an unknown ordering, $O(\log T)$ ERM queries suffice, and for $k$-Intervals, $O(T^3 2^{2k})$ weak consistency queries suffice.
Benign Overfitting in Single-Head Attention
Roey Magen · Shuning Shang · Zhiwei Xu · Spencer Frei · Wei Hu · Gal Vardi
The phenomenon of benign overfitting, where a trained neural network perfectly fits noisy training data but still achieves near-optimal test performance, has been extensively studied in recent years for linear models and fully-connected/convolutional networks. In this work, we study benign overfitting in a single-head softmax attention model, which is the fundamental building block of Transformers. We prove that under appropriate conditions, the model exhibits benign overfitting in a classification setting already after two steps of gradient descent. Moreover, we show conditions where a minimum-norm/maximum-margin interpolator exhibits benign overfitting. We study how the overfitting behavior depends on the signal-to-noise ratio (SNR) of the data distribution, namely, the ratio between norms of signal and noise tokens, and prove that a sufficiently large SNR is both necessary and sufficient for benign overfitting.
Optimal Best Arm Identification under Differential Privacy
Marc Jourdan · Achraf Azize
Best Arm Identification (BAI) algorithms are deployed in data-sensitive applications, such as adaptive clinical trials or user studies. Driven by the privacy concerns of these applications, we study the problem of fixed-confidence BAI under global Differential Privacy (DP) for Bernoulli distributions. While numerous asymptotically optimal BAI algorithms exist in the non-private setting, a significant gap remains between the best lower and upper bounds in the global DP setting. This work reduces this gap to a small multiplicative constant, for any privacy budget $\epsilon$. First, we provide a tighter lower bound on the expected sample complexity of any $\delta$-correct and $\epsilon$-global DP strategy. Our lower bound replaces the Kullback–Leibler (KL) divergence in the transportation cost used by the non-private characteristic time with a new information-theoretic quantity that optimally trades off between the KL divergence and the Total Variation distance scaled by $\epsilon$. Second, we introduce a stopping rule based on these transportation costs and a private estimator of the means computed using an arm-dependent geometric batching. En route to proving the correctness of our stopping rule, we derive concentration results of independent interest for the Laplace distribution and for the sum of Bernoulli and Laplace distributions. Third, we propose a Top Two sampling rule based on these transportation costs. For any budget $\epsilon$, we show an asymptotic upper bound on its expected sample complexity that matches our lower bound to a multiplicative constant smaller than $8$. Our algorithm outperforms existing $\delta$-correct and $\epsilon$-global DP BAI algorithms for different values of $\epsilon$.
Asymptotic theory of SGD with a general learning-rate
Or Goldreich · Ziyang Wei · SOHAM BONNERJEE · Jiaqi Li · Wei Biao Wu
Stochastic gradient descent (SGD) with polynomially decaying step‐sizes has long underpinned theoretical analyses, yielding a broad spectrum of statistically attractive guarantees. Yet in practice, such schedules find rare use due to their prohibitively slow convergence, revealing a persistent gap between theory and empirical performance. In this paper, we introduce a unified framework that quantifies the uncertainty of online SGD under arbitrary learning‐rate choices. In particular, we provide the first comprehensive convergence characterizations for two widely used but theoretically under-examined schemes—cyclical learning rates and linear decay to zero. Our results not only explain the observed behavior of these schedules but also facilitate principled tools for statistical inference and algorithm design. All theoretical findings are corroborated by extensive simulations across diverse settings.
Accelerating Feature Conformal Prediction via Taylor Approximation
Zihao Tang · Boyuan Wang · Chuan Wen · Jiaye Teng
Conformal prediction is widely adopted in uncertainty quantification, due to its post-hoc, distribution-free, and model-agnostic properties. In the realm of modern deep learning, researchers have proposed Feature Conformal Prediction (FCP), which deploys conformal prediction in a feature space, yielding reduced band lengths. However, the practical utility of FCP is limited due to the time-consuming non-linear operations required to transform confidence bands from feature space to output space. In this paper, we present Fast Feature Conformal Prediction (FFCP), a method that accelerates FCP by leveraging a first-order Taylor expansion to approximate these non-linear operations. The proposed FFCP introduces a novel non-conformity score that is both effective and efficient for real-world applications. Empirical validations showcase that FFCP performs comparably with FCP (both outperforming the vanilla version) while achieving a significant reduction in computational time by approximately 50x in both regression and classification tasks.
Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning
Emile Anand · Ishani Karmarkar · Guannan Qu
Designing efficient algorithms for multi-agent reinforcement learning (MARL) is fundamentally challenging because the size of the joint state and action spaces grows exponentially in the number of agents. These difficulties are exacerbated when balancing sequential global decision-making with local agent interactions. In this work, we propose a new algorithm $\texttt{SUBSAMPLE-MFQ}$ ($\textbf{Subsample}$-$\textbf{M}$ean-$\textbf{F}$ield-$\textbf{Q}$-learning) and a decentralized randomized policy for a system with $n$ agents. For any $k\leq n$, our algorithm learns a policy for the system in time polynomial in $k$. We prove that this learned policy converges to the optimal policy on the order of $\tilde{O}(1/\sqrt{k})$ as the number of subsampled agents $k$ increases. In particular, this bound is independent of the number of agents $n$.
Scalable Policy-Based RL Algorithms for POMDPs
Ameya Anjarlekar · S. Rasoul Etesami · R. Srikant
The continuous nature of belief states in POMDPs presents significant computational challenges in learning the optimal policy. In this paper, we consider an approach that solves a Partially Observable Reinforcement Learning (PORL) problem by approximating the corresponding POMDP model into a finite-state Markov Decision Process (MDP) (called Superstate MDP). We first derive theoretical guarantees that improve upon prior work that relate the optimal value function of the transformed Superstate MDP to the optimal value function of the original POMDP. Next, we propose a policy-based learning approach with linear function approximation to learn the optimal policy for the Superstate MDP. Consequently, our approach shows that a POMDP can be approximately solved using TD-learning followed by Policy Optimization by treating it as an MDP, where the MDP state corresponds to a finite history. We show that the approximation error decreases exponentially with the length of this history. To the best of our knowledge, our finite-time bounds are the first to explicitly quantify the error introduced when applying standard TD learning to a setting where the true dynamics are not Markovian.
Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback
Ofir Schlisselberg · Tal Lancewicki · Peter Auer · Yishay Mansour
We study the multi-armed bandit problem with adversarially chosen delays in the Best-of-Both-Worlds (BoBW) framework, which aims to achieve near-optimal performance in both stochastic and adversarial environments. While prior work has made progress toward this goal, existing algorithms suffer from significant gaps to the known lower bounds, especially in the stochastic settings. Our main contribution is a new algorithm that, up to logarithmic factors, matches the known lower bounds in each setting individually. In the adversarial case, our algorithm achieves regret of $\widetilde{O}(\sqrt{KT} + \sqrt{D})$, which is optimal up to logarithmic terms, where $T$ is the number of rounds, $K$ is the number of arms, and $D$ is the cumulative delay. In the stochastic case, we provide a regret bound which scale as $\sum_{i:\Delta_i>0}(\log T/\Delta_i) + \frac{1}{K}\sum \Delta_i \sigma_{max}$, where $\Delta_i$ is the suboptimality gap of arm $i$ and $\sigma_{\max}$ is the maximum number of missing observations. To the best of our knowledge, this is the first BoBW algorithm to simultaneously match the lower bounds in both stochastic and adversarial regimes. Moreover, even beyond the BoBW setting, our stochastic regret bound is the first to match the known lower bound under adversarial delays, improving the second term over the best known result by a factor of $K$.
Classical Planning with LLM-Generated Heuristics: Challenging the State of the Art with Python Code
Augusto B. Corrêa · André G. Pereira · Jendrik Seipp
In recent years, large language models (LLMs) have shown remarkable performance in many problems. However, they fail to plan reliably. Specialized attempts to improve their planning capabilities still produce incorrect plans and fail to generalize to larger tasks. Furthermore, LLMs designed for explicit "reasoning" fail to compete with automated planners while increasing computational costs, which reduces one of the advantages of using LLMs. In this paper, we show how to use LLMs to always generate correct plans, even for out-of-distribution tasks of increasing size. For a given planning domain, we ask an LLM to generate several domain-dependent heuristic functions in the form of Python code, evaluate them on a set of training tasks with a greedy best-first search, and choose the best one. The resulting LLM-generated heuristic functions solve substantially more unseen out-of-distribution test tasks than end-to-end LLM planning, particularly for non-reasoning LLMs. Moreover, they also solve many more tasks than state-of-the-art domain-independent heuristics for classical planning, and are competitive with the strongest learning algorithm for domain-dependent planning. These results are impressive given that our implementation is based on a Python planner and the baselines all build upon highly optimized C++ code. In some domains, the LLM-generated heuristics expand fewer states than the baselines, showing that they are not only efficiently computable but also more informative than the state-of-the-art heuristics. Overall, our results show that sampling a set of planning heuristic functions can significantly improve the planning capabilities of LLMs.
FedEL: Federated Elastic Learning for Heterogeneous Devices
Letian Zhang · Bo Chen · Jieming Bian · Lei Wang · Jie Xu
Federated learning (FL) enables distributed devices to collaboratively train machine learning (ML) models while maintaining data privacy. However, the heterogeneous hardware capabilities of participating devices often result in significant training delays, as straggler clients with limited resources prolong the aggregation process. Existing solutions such as client selection, asynchronous FL, and partial training partially address these challenges but encounter issues such as reduced accuracy, stale updates, and compromised model performance due to inconsistent training contributions. To overcome these limitations, we propose FedEL, a federated elastic learning framework that enhances training efficiency while maintaining model accuracy. FedEL introduces a novel window-based training process, sliding the window to locate the training part of the model and dynamically selecting important tensors for training within a coordinated runtime budget. This approach ensures progressive and balanced training across all clients, including stragglers. Additionally, FedEL employs a tensor importance adjustment module, harmonizing local and global tensor importance to mitigate biases caused by data heterogeneity. The experiment results shows that FedEL achieves up to 3.87× improvement in time-to-accuracy compared to baselines while maintaining or exceeding final test accuracy.
Accelerating Block Coordinate Descent for LLM Finetuning via Landscape Expansion
Qijun Luo · Yifei Shen · Liangzu Peng · Dongsheng Li · Xiao Li
Finetuning large language models (LLMs) is a resource-intensive task for researchers in academia, with memory constraints posing a key bottleneck. A classic optimization method, block coordinate descent (BCD), significantly reduces memory cost by segmenting the trainable parameters into multiple blocks and optimizing one active block at a time while freezing the others. However, we identify that blindly applying BCD to train LLMs can be inefficient for two reasons. First, optimizing only the active block requires backpropagating through multiple deeper yet inactive blocks, resulting in wasteful computations. Second, the frozen blocks, when they are not quite close to optimality, can narrow the optimization landscape, potentially misguiding the training of the active block. To address these issues simultaneously, we propose integrating BCD with landscape expansion, which unfreezes the inactive blocks and updates them in a cost-efficient manner during the same backpropagation as the update to the active block. Experiments on 8B and 70B models demonstrate that our proposed method surpasses memory-efficient baselines and matches Adam's downstream performance while requiring only 24 GB of memory for the 8B model and 300 GB for the 70B model.
TADA: Improved Diffusion Sampling with Training-free Augmented DynAmics
Tianrong Chen · Huangjie Zheng · David Berthelot · Jiatao Gu · Joshua Susskind · Shuangfei Zhai
Diffusion models have demonstrated exceptional capabilities in generating high-fidelity images but typically suffer from inefficient sampling. Many solver designs and noise scheduling strategies have been proposed to dramatically improve sampling speeds. In this paper, we introduce a new sampling method that is up to 186\% faster than the current state of the art solver for comparative FID on ImageNet512. This new sampling method is training-free and uses an ordinary differential equation (ODE) solver. The key to our method resides in using higher-dimensional initial noise, allowing to produce more detailed samples with less function evaluations from existing pretrained diffusion models. In addition, by design our solver allows to control the level of detail through a simple hyper-parameter at no extra computational cost. We present how our approach leverages momentum dynamics by establishing a fundamental equivalence between momentum diffusion models and conventional diffusion models with respect to their training paradigms. Moreover, we observe the use of higher-dimensional noise naturally exhibits characteristics similar to stochastic differential equations (SDEs). Finally, we demonstrate strong performances on a set of representative pretrained diffusion models, including EDM, EDM2, and Stable-Diffusion 3, which cover models in both pixel and latent spaces, as well as class and text conditional settings. The code is available at https://github.com/apple/ml-tada.
HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models
Haoran Li · Yingjie Qin · Baoyuan Ou · Lai Xu · Ruiwen Xu
Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long contexts, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness.
Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation
Erfan Baghaei Potraghloo · Seyedarmin Azizi · Souvik Kundu · Massoud Pedram
Large language models (LLMs), despite their impressive performance across a wide range of tasks, often struggle to balance two competing objectives in open-ended text generation: fostering diversity and creativity while preserving logical coherence. Existing truncated sampling techniques, including temperature scaling, top-p (nucleus) sampling, and min-p sampling, aim to manage this trade-off. However, they exhibit limitations, particularly in the effective incorporation of the confidence of the model into the corresponding sampling strategy. For example, min-p sampling relies on a single top token as a heuristic for confidence, eventually underutilizing the information of the probability distribution. To effectively incorporate the model confidence, this paper presents top-H decoding. We first establish the theoretical foundation of the interplay between creativity and coherence in truncated sampling by formulating an entropy-constrained minimum divergence problem. We then prove this minimization problem to be equivalent to an entropy-constrained mass maximization (ECMM) problem, which is NP-hard. Finally, we present top-H decoding, a computationally efficient greedy algorithm to solve the ECMM problem. Extensive empirical evaluations demonstrate that top-H outperforms the state-of-the-art (SoTA) alternative of min-p sampling by up to 25.63% on creative writing benchmarks, while maintaining robustness on question-answering datasets such as GPQA, GSM8K, and MT-Bench. Additionally, an LLM-as-judge evaluation confirms that top-H indeed produces coherent outputs even at higher temperatures, where creativity is especially critical. In summary, top-H advances SoTA in open-ended text generation and can be easily integrated into creative writing applications. The code is available at https://github.com/ErfanBaghaei/Top-H-Decoding.
Improving Formal Reasoning of Transformer with State Stack
Kechi Zhang · Ge Li · Jia Li · Huangzhao Zhang · Yihong Dong · Jia Li · Jingjing Xu · Zhi Jin
The Transformer architecture has emerged as a landmark advancement within the broad field of artificial intelligence, effectively catalyzing the advent of large language models (LLMs). However, despite its remarkable capabilities and the substantial progress it has facilitated, the Transformer architecture still has some limitations. One such intrinsic limitation is its inability to effectively recognize regular expressions or deterministic context-free grammars. Drawing inspiration from pushdown automata, which efficiently resolve deterministic context-free grammars using stacks, we equip layers with a differentiable stack and propose StackTrans to address the aforementioned issue within LLMs. Unlike previous approaches that modify the attention computation, StackTrans explicitly incorporates hidden state stacks between Transformer layers. This design maintains compatibility with existing frameworks like flash-attention. Specifically, our design features stack operations -- such as pushing and popping hidden states -- that are differentiable and can be learned in an end-to-end manner. Our comprehensive evaluation spans benchmarks for both Chomsky hierarchy and large-scale natural languages. Across these diverse tasks, StackTrans consistently outperforms standard Transformer models and other baselines. We have successfully scaled StackTrans up from 360M to 7B parameters. In particular, our from-scratch pretrained model StackTrans-360M outperforms several larger open-source LLMs with 2–3x more parameters, showcasing its superior efficiency and reasoning capability.
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
Xiang Liu · Zhenheng Tang · Peijie Dong · Zeyu Li · Liuyue · Bo Li · Xuming Hu · Xiaowen Chu
Large Language Models (LLMs) require significant GPU memory when processing long texts, with the key value (KV) cache consuming up to 70\% of total memory during inference. Although existing compression methods reduce memory by evaluating the importance of individual tokens, they overlook critical semantic relationships between tokens, resulting in fragmented context and degraded performance. We introduce \method{}, which fundamentally reimagines KV cache compression by treating semantic chunks - rather than isolated tokens - as basic compression units. This approach preserves complete linguistic structures and contextual integrity, ensuring that essential meaning is retained even under aggressive compression. Our innovation includes a novel layer-wise index reuse technique that exploits the higher cross-layer similarity of preserved indices in \method{}, reducing computational overhead and improving throughput by 26.5\%. Comprehensive evaluations on challenging benchmarks: LongBench, Needle-In-A-HayStack, GSM8K, and JailbreakV demonstrate that \method{} outperforms state-of-the-art methods by up to 8.7\% in precision while maintaining the same compression ratio. These results confirm that semantic-aware compression significantly enhances both efficiency and performance for long-context LLM inference, providing a simple yet effective solution to the memory bottleneck problem. \emph{The code is available at \href{https://github.com/NVIDIA/kvpress}{link}.}
Gaussian Process Upper Confidence Bound Achieves Nearly-Optimal Regret in Noise-Free Gaussian Process Bandits
Shogo Iwazaki
We study the noise-free Gaussian Process (GP) bandit problem, in which a learner seeks to minimize regret through noise-free observations of a black-box objective function that lies in a known reproducing kernel Hilbert space (RKHS). The Gaussian Process Upper Confidence Bound (GP-UCB) algorithm is a well-known approach for GP bandits, where query points are adaptively selected based on the GP-based upper confidence bound score. While several existing works have reported the practical success of GP-UCB, its theoretical performance remains suboptimal. However, GP-UCB often empirically outperforms other nearly-optimal noise-free algorithms that use non-adaptive sampling schemes. This paper resolves the gap between theoretical and empirical performance by establishing a nearly-optimal regret upper bound for noise-free GP-UCB. Specifically, our analysis provides the first constant cumulative regret bounds in the noise-free setting for both the squared exponential kernel and the Matérn kernel with some degree of smoothness.
Multi-modal contrastive learning adapts to intrinsic dimensions of shared latent variables
Yu Gui · Cong Ma · Zongming Ma
Multi-modal contrastive learning as a self-supervised representation learning technique has achieved great success in foundation model training, such as CLIP~\citep{radford2021learning}. In this paper, we study the theoretical properties of the learned representations from multi-modal contrastive learning beyond linear representations and specific data distributions. Our analysis reveals that, enabled by temperature optimization, multi-modal contrastive learning not only maximizes mutual information between modalities but also adapts to intrinsic dimensions of data, which can be much lower than user-specified dimensions for representation vectors. Experiments on both synthetic and real-world datasets demonstrate the ability of contrastive learning to learn low-dimensional and informative representations, bridging theoretical insights and practical performance.
Heavy-Ball Momentum Method in Continuous Time and Discretization Error Analysis
Bochen Lyu · Xiaojing Zhang · Fangyi Zheng · He Wang · Zheng Wang · Zhanxing Zhu
This paper establishes a continuous time approximation, a piece-wise continuous differential equation, for the discrete Heavy-Ball (HB) momentum method with explicit discretization error. Investigating continuous differential equations has been a promising approach for studying the discrete optimization methods. Despite the crucial role of momentum in gradient-based optimization methods, the gap between the original dynamics and the continuous time approximations due to the discretization error has not been comprehensively bridged yet. In this work, we study the HB momentum method in continuous time while putting more focus on the discretization error to provide additional theoretical tools to this area. In particular, we design a first-order piece-wise continuous differential equation, where we add a number of counter terms to account for the discretization error explicitly. As a result, we provide a continuous time model for the HB momentum method that allows the control of discretization error to arbitrary order of the learning rate. As an application, we leverage it to find a new implicit regularization of the directional smoothness and investigate the implicit bias of HB for diagonal linear networks, indicating how our results can be used in deep learning. Our theoretical findings are further supported by numerical experiments.
Generalized Top-k Mallows Model for Ranked Choices
Shahrzad Haddadan · Sara Ahmadian
The classic Mallows model is a foundational tool for modeling user preferences. However, it has limitations in capturing real-world scenarios, where users often focus only on a limited set of preferred items and are indifferent to the rest. To address this, extensions such as the top-$k$ Mallows model have been proposed, aligning better with practical applications. In this paper, we address several challenges related to the generalized top-$k$ Mallows model, with a focus on analyzing buyer choices. Our key contributions are: (1) a novel sampling scheme tailored to generalized top-$k$ Mallows models, (2) an efficient algorithm for computing choice probabilities under this model, and (3) an active learning algorithm for estimating the model parameters from observed choice data. These contributions provide new tools for analysis and prediction in critical decision-making scenarios. We present a rigorous mathematical analysis for the performance of our algorithms. Furthermore, through extensive experiments on synthetic data and real-world data, we demonstrate the scalability and accuracy of our proposed methods, and we compare the predictive power of Mallows model for top-$k$ lists compared to the simpler Multinomial Logit model.
Exponential Convergence Guarantees for Iterative Markovian Fitting
Marta Gentiloni Silveri · Giovanni Conforti · Alain Durmus
The Schrödinger Bridge (SB) problem has become a fundamental tool in computational optimal transport and generative modeling. To address this problem, ideal methods such as Iterative Proportional Fitting and Iterative Markovian Fitting (IMF) have been proposed—alongside practical approximations like Diffusion Schrödinger Bridge and its Matching (DSBM) variant. While previous work have established asymptotic convergence guarantees for IMF, a quantitative, non-asymptotic understanding remains unknown. In this paper, we provide the first non-asymptotic exponential convergence guarantees for IMF under mild structural assumptions on the reference measure and marginal distributions, assuming a sufficiently large time horizon. Our results encompass two key regimes: one where the marginals are log-concave, and another where they are weakly log-concave. The analysis relies on new contraction results for the Markovian projection operator and paves the way to theoretical guarantees for DSBM.
On the Hardness of Conditional Independence Testing In Practice
Zheng He · Roman Pogodin · Yazhe Li · Namrata Deka · Arthur Gretton · Danica J. Sutherland
Tests of conditional independence (CI) underpin a number of important problems in machine learning and statistics, from causal discovery to evaluation of predictor fairness and out-of-distribution robustness. Shah and Peters (2020) showed that, contrary to the unconditional case, no universally finite-sample valid test can ever achieve nontrivial power. While informative, this result (based on “hiding” dependence) does not seem to explain the frequent practical failures observed with popular CI tests. We investigate the Kernel-based Conditional Independence (KCI) test – of which we show the Generalized Covariance Measure underlying many recent tests is nearly a special case – and identify the major factors underlying its practical behavior. We highlight the key role of errors in the conditional mean embedding estimate for the Type I error, while pointing out the importance of selecting an appropriate conditioning kernel (not recognized in previous work) as being necessary for good test power but also tending to inflate Type I error.
Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents
Jane Lee · Baturay Saglam · Spyridon Pougkakiotis · Amin Karbasi · Dionysis Kalogerias
Constrained optimization provides a common framework for dealing with conflicting objectives in reinforcement learning (RL). In most of these settings, the objectives (and constraints) are expressed though the expected accumulated reward. However, this formulation neglects risky or even possibly catastrophic events at the tails of the reward distribution, and is often insufficient for high-stakes applications in which the risk involved in outliers is critical. In this work, we propose a framework for risk-aware constrained RL, which exhibits per-stage robustness properties jointly in reward values and time using optimized certainty equivalents (OCEs). Our framework ensures an exact equivalent to the original constrained problem within a parameterized strong Lagrangian duality framework under appropriate constraint qualifications, and yields a simple algorithmic recipe which can be wrapped around standard RL solvers, such as PPO. Lastly, we establish the convergence of the proposed algorithm and verify the risk-aware properties of our approach through several numerical experiments.
Offline imitation learning in $Q^\pi$-realizable MDPs without expert realizability
Antoine Moulin · Gergely Neu · Luca Viano
We study the problem of offline imitation learning in Markov decision processes (MDPs), where the goal is to learn a well-performing policy given a dataset of state-action pairs generated by an expert policy. Complementing a recent line of work on this topic that assumes that the expert policy belongs to a tractable class of known policies, we approach this problem from a new angle and leverage another type of structural assumption about the environment. Specifically, for the class of linear $Q^\pi$-realizable MDPs, we introduce a new algorithm called primal-dual offline imitation learning (PDOIL), which is guaranteed to match the performance of any expert up to an additive error $\varepsilon$ with access to $\mathcal{O}(\varepsilon^{-2})$ samples. Moreover, we extend this result to possibly non-linear $Q^\pi$-realizable MDPs at the cost of a worst sample complexity of order $\mathcal{O}(\varepsilon^{-4})$. Finally, our analysis suggests a new loss function for training critic networks from expert data in deep imitation learning. Empirical evaluations on standard benchmarks demonstrate that the neural net implementation of PDOIL is superior to behavior cloning and competitive with state-of-the-art algorithms.
Interactive and Hybrid Imitation Learning: Provably Beating Behavior Cloning
Yichen Li · Chicheng Zhang
Imitation learning (IL) is a paradigm for learning sequential decision-making policies from experts, leveraging offline demonstrations, interactive annotations, or both. Recent advances show that when annotation cost is tallied per trajectory, Behavior Cloning (BC)—which relies solely on offline demonstrations—cannot be improved in general, leaving limited conditions for interactive methods such as DAgger to help. We revisit this conclusion and prove that when the annotation cost is measured per state, algorithms using interactive annotations can provably outperform BC. Specifically: (1) we show that Stagger, a one‑sample‑per‑round variant of DAgger, provably beats BC under low-recovery-cost settings; (2) we initiate the study of hybrid IL where the agent learns from offline demonstrations and interactive annotations. We propose Warm-Stagger whose learning guarantee is not much worse than using either data source alone. Furthermore, motivated by compounding error and cold‑start problem in imitation learning practice, we give an MDP example in which Warm-Stagger has significant better annotation cost; (3) experiments on MuJoCo continuous‑control tasks confirm that, with modest cost ratio between interactive and offline annotations, interactive and hybrid approaches consistently outperform BC. To the best of our knowledge, our work is the first to highlight the benefit of state‑wise interactive annotation and hybrid feedback in imitation learning.
Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation
Toshinori Kitamura · Arnob Ghosh · Tadashi Kozuno · Wataru Kumagai · Kazumi Kasaura · Kenta Hoshino · Yohei Hosoe · Yutaka Matsuo
We study the reinforcement learning (RL) problem in a constrained Markov decision process (CMDP), where an agent explores the environment to maximize the expected cumulative reward while satisfying a single constraint on the expected total utility value in every episode. While this problem is well understood in the tabular setting, theoretical results for function approximation remain scarce. This paper closes the gap by proposing an RL algorithm for linear CMDPs that achieves $\widetilde{\mathcal{O}}(\sqrt{K})$ regret with an episode-wise zero-violation guarantee. Furthermore, our method is computationally efficient, scaling polynomially with problem-dependent parameters while remaining independent of the state space size. Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.
A Near-optimal, Scalable and Parallelizable Framework for Stochastic Bandits Robust to Adversarial Corruptions and Beyond
Zicheng Hu · Cheng Chen
We investigate various stochastic bandit problems in the presence of adversarial corruptions. A seminal work for this problem is the BARBAR~\cite{gupta2019better} algorithm, which achieves both robustness and efficiency. However, it suffers from a regret of $O(KC)$, which does not match the lower bound of $\Omega(C)$, where $K$ denotes the number of arms and $C$ denotes the corruption level. In this paper, we first improve the BARBAR algorithm by proposing a novel framework called BARBAT, which eliminates the factor of $K$ to achieve an optimal regret bound up to a logarithmic factor. We also extend BARBAT to various settings, including multi-agent bandits, graph bandits, combinatorial semi-bandits and batched bandits. Compared with the Follow-the-Regularized-Leader framework, our methods are more amenable to parallelization, making them suitable for multi-agent and batched bandit settings, and they incur lower computational costs, particularly in semi-bandit problems. Numerical experiments verify the efficiency of the proposed methods.
Projection-based Lyapunov method for fully heterogeneous weakly-coupled MDPs
Xiangcheng Zhang · Yige Hong · Weina Wang
Heterogeneity poses a fundamental challenge for many real-world large-scale decision-making problems but remains largely understudied. In this paper, we study the _fully heterogeneous_ setting of a prominent class of such problems, known as weakly-coupled Markov decision processes (WCMDPs). Each WCMDP consists of $N$ arms (or subproblems), which have distinct model parameters in the fully heterogeneous setting, leading to the curse of dimensionality when $N$ is large. We show that, under mild assumptions, an efficiently computable policy achieves an $O(1/\sqrt{N})$ optimality gap in the long-run average reward per arm for fully heterogeneous WCMDPs as $N$ becomes large. This is the _first asymptotic optimality result_ for fully heterogeneous average-reward WCMDPs. Our main technical innovation is the construction of projection-based Lyapunov functions that certify the convergence of rewards and costs to an optimal region, even under full heterogeneity.
Gains: Fine-grained Federated Domain Adaptation in Open Set
Zhengyi Zhong · Wenzheng Jiang · Weidong Bao · Ji Wang · Qi Wang · Guanbo Wang · Yongheng Deng · Ju Ren
Conventional federated learning (FL) assumes a closed world with a fixed total number of clients. In contrast, new clients continuously join the FL process in real-world scenarios, introducing new knowledge. This raises two critical demands: detecting new knowledge, i.e., knowledge discovery, and integrating it into the global model, i.e., knowledge adaptation. Existing research focuses on coarse-grained knowledge discovery, and often sacrifices source domain performance and adaptation efficiency. To this end, we propose a fine-grained federated domain adaptation approach in open set (Gains). Gains splits the model into an encoder and a classifier, empirically revealing features extracted by the encoder are sensitive to domain shifts while classifier parameters are sensitive to class increments. Based on this, we develop fine-grained knowledge discovery and contribution-driven aggregation techniques to identify and incorporate new knowledge. Additionally, an anti-forgetting mechanism is designed to preserve source domain performance, ensuring balanced adaptation. Experimental results on multi-domain datasets across three typical data-shift scenarios demonstrate that Gains significantly outperforms other baselines in performance for both source-domain and target-domain clients. Code is available at: https://github.com/Zhong-Zhengyi/Gains.
AmorLIP: Efficient Language-Image Pretraining via Amortization
Haotian Sun · Yitong Li · Yuchen Zhuang · Niao He · Hanjun Dai · Bo Dai
Contrastive Language-Image Pretraining (CLIP) has demonstrated strong zero-shot performance across diverse downstream text-image tasks. Existing CLIP methods typically optimize a contrastive objective using negative samples drawn from each minibatch. To achieve robust representation learning, these methods require extremely large batch sizes and escalate computational demands to hundreds or even thousands of GPUs. Prior approaches to mitigate this issue often compromise downstream performance, prolong training duration, or face scalability challenges with very large datasets. To overcome these limitations, we propose AmorLIP, an efficient CLIP pretraining framework that amortizes expensive computations involved in contrastive learning through lightweight neural networks, which substantially improves training efficiency and performance. Leveraging insights from a spectral factorization of energy-based models, we introduce novel amortization objectives along with practical techniques to improve training stability. Extensive experiments across 38 downstream tasks demonstrate the superior zero-shot classification and retrieval capabilities of AmorLIP, consistently outperforming standard CLIP baselines with substantial relative improvements of up to 12.24%.
Fixed-Point RNNs: Interpolating from Diagonal to Dense
Sajad Movahedi · Felix Sarnthein · Nicola Muca Cirone · Antonio Orvieto
Linear recurrent neural networks (RNNs) and state-space models (SSMs) such as Mamba have become promising alternatives to softmax-attention as sequence mixing layers in Transformer architectures. Current models, however, do not exhibit the full state-tracking expressivity of RNNs because they rely on channel-wise (i.e. diagonal) sequence mixing. In this paper, we investigate parameterizations of a large class of dense linear RNNs as fixed-points of parallelizable diagonal linear RNNs. The resulting models can naturally trade expressivity for efficiency at a fixed number of parameters and achieve state-of-the-art results on the state-tracking benchmarks $A_5$ and $S_5$, while matching performance on copying and other tasks.
Teaching Transformers to Solve Combinatorial Problems through Efficient Trial & Error
Panagiotis Giannoulis · Yorgos Pantis · Christos Tzamos
Despite their proficiency in various language tasks, Large Language Models (LLMs) struggle with combinatorial problems like Satisfiability, Traveling Salesman Problem, or even basic arithmetic. We address this gap through a novel approach for solving problems in the class NP. We focus on the paradigmatic task of Sudoku and achieve state-of-the-art accuracy (99\%) compared to prior neuro-symbolic approaches. Unlike prior work that used custom architectures, our method employs a vanilla decoder-only Transformer (GPT-2) without external tools or function calling. Our method integrates imitation learning of simple Sudoku rules with an explicit Depth-First Search (DFS) exploration strategy involving informed guessing and backtracking. Moving beyond imitation learning, we seek to minimize the number of guesses until reaching a solution. We provide a rigorous analysis of this setup by formalizing its connection to a contextual variant of $\textit{Min-Sum Set Cover}$, a well-studied problem in algorithms and stochastic optimization.
Pan-LUT: Efficient Pan-sharpening via Learnable Look-Up Tables
Zhongnan Cai · Yingying Wang · Hui Zheng · Panwang Pan · Zixu Lin · Ge Meng · Chenxin Li · Chunming He · Jiaxin Xie · Yunlong Lin · Junbin Lu · Yue Huang · Xinghao Ding
Recently, deep learning-based pan-sharpening algorithms have achieved notable advancements over traditional methods. However, deep learning-based methods incur substantial computational overhead during inference, especially with large images. This excessive computational demand limits the applicability of these methods in real-world scenarios, particularly in the absence of dedicated computing devices such as GPUs and TPUs. To address these challenges, we propose Pan-LUT, a novel learnable look-up table (LUT) framework for pan-sharpening that strikes a balance between performance and computational efficiency for large remote sensing images. Our method makes it possible to process 15K$\times$15K remote sensing images on a 24GB GPU. To finely control the spectral transformation, we devise the PAN-guided look-up table (PGLUT) for channel-wise spectral mapping. To effectively capture fine-grained spatial details, we introduce the spatial details look-up table (SDLUT). Furthermore, to adaptively aggregate channel information for generating high-resolution multispectral images, we design an adaptive output look-up table (AOLUT). Our model contains fewer than 700K parameters and processes a 9K$\times$9K image in under 1 ms using one RTX 2080 Ti GPU, demonstrating significantly faster performance compared to other methods. Experiments reveal that Pan-LUT efficiently processes large remote sensing images in a lightweight manner, bridging the gap to real-world applications. Furthermore, our model surpasses SOTA methods in full-resolution scenes under real-world conditions, highlighting its effectiveness and efficiency.
Alleviating Hallucinations in Large Language Models through Multi-Model Contrastive Decoding and Dynamic Hallucination Detection
Chenyu Zhu · Yefeng Liu · Hao Zhang · Aowen Wang · Yangxue · Guanhua Chen · Longyue Wang · Weihua Luo · Kaifu Zhang
Despite their outstanding performance in numerous applications, large language models (LLMs) remain prone to hallucinations, generating content inconsistent with their pretraining corpora. Currently, almost all contrastive decoding approaches alleviate hallucinations by introducing a model susceptible to hallucinations and appropriately widening the contrastive logits gap between hallucinatory tokens and target tokens. However, although existing contrastive decoding methods mitigate hallucinations, they lack enough confidence in the factual accuracy of the generated content. In this work, we propose Multi-Model Contrastive Decoding (MCD), which integrates a pretrained language model with an evil model and a truthful model for contrastive decoding. Intuitively, a token is assigned a high probability only when deemed potentially hallucinatory by the evil model while being considered factual by the truthful model. This decoding strategy significantly enhances the model’s confidence in its generated responses and reduces potential hallucinations. Furthermore, we introduce a dynamic hallucination detection mechanism that facilitates token-by-token identification of hallucinations during generation and a tree-based revision mechanism to diminish hallucinations further. Extensive experimental evaluations demonstrate that our MCD strategy effectively reduces hallucinations in LLMs and outperforms state-of-the-art methods across various benchmarks.
Quartet: Native FP4 Training Can Be Optimal for Large Language Models
Roberto Castro · Andrei Panferov · Rush Tabesh · Oliver Sieberling · Jiale Chen · Mahdi Nikdan · Saleh Ashkboos · Dan Alistarh
Training large language models (LLMs) models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce a new approach for accurate, end-to-end FP4 training with all the major computations (i.e., linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an "optimal" technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet .
InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models
Yanggan Gu · Yuanyi Wang · Zhaoyi Yan · Yiming Zhang · Qi Zhou · Fei Wu · Hongxia Yang
Model fusion combines multiple Large Language Models (LLMs) with different strengths into a more powerful, integrated model through lightweight training methods. Existing works on model fusion focus primarily on supervised fine-tuning (SFT), leaving preference alignment (PA) —a critical phase for enhancing LLM performance—largely unexplored. The current few fusion methods on PA phase, like WRPO, simplify the process by utilizing only response outputs from source models while discarding their probability information. To address this limitation, we propose InfiFPO, a preference optimization method for implicit model fusion. InfiFPO replaces the reference model in Direct Preference Optimization (DPO) with a fused source model that synthesizes multi-source probabilities at the sequence level, circumventing complex vocabulary alignment challenges in previous works and meanwhile maintaining the probability information. By introducing probability clipping and max-margin fusion strategies, InfiFPO enables the pivot model to align with human preferences while effectively distilling knowledge from source models. Comprehensive experiments on 11 widely-used benchmarks demonstrate that InfiFPO consistently outperforms existing model fusion and preference optimization methods. When using Phi-4 as the pivot model, InfiFPO improves its average performance from 79.95 to 83.33 on 11 benchmarks, significantly improving its capabilities in mathematics, coding, and reasoning tasks.
GoRA: Gradient-driven Adaptive Low Rank Adaptation
haonan he · Peng Ye · Yuchen Ren · yuan yuan · LuyangZhou · ShucunJu · lei chen
Low-Rank Adaptation (LoRA) is a crucial method for efficiently fine-tuning large language models (LLMs), with its effectiveness influenced by two key factors: rank selection and weight initialization. While numerous LoRA variants have been proposed to improve performance by addressing one of these aspects, they often compromise usability or computational efficiency. In this paper, we analyze and identify the core limitations of existing approaches and propose a novel framework—GoRA (Gradient-driven Adaptive Low Rank Adaptation)—that simultaneously adapts both the rank and initialization strategy within a unified framework. GoRA leverages gradient information during training to dynamically assign optimal ranks and initialize low-rank adapter weights in an adaptive manner. To our knowledge, GoRA is the first method that not only addresses the limitations of prior approaches—which often focus on either rank selection or initialization in isolation—but also unifies both aspects within a single framework, enabling more effective and efficient adaptation. Extensive experiments across various architectures and modalities show that GoRA consistently outperforms existing LoRA-based methods while preserving the efficiency of vanilla LoRA. For example, when fine-tuning Llama3.1-8B-Base for mathematical reasoning, GoRA achieves a 5.13-point improvement over standard LoRA and even outperforms full fine-tuning by 2.05 points under high-rank settings. Code is available at: https://github.com/hhnqqq/MyTransformers.
Learning in Compact Spaces with Approximately Normalized Transformer
Jörg Franke · Urs Spiegelhalter · Marianna Nezhurina · Jenia Jitsev · Frank Hutter · Michael Hefenbrock
The successful training of deep neural networks requires addressing challenges such as overfitting, numerical instabilities leading to divergence, and increasing variance in the residual stream. A common solution is to apply regularization and normalization techniques that usually require tuning additional hyperparameters. An alternative is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic, approximate normalization via simple scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. Additionally, instead of applying strict normalization for the parameters, we constrain their norms. These modifications remove the need for weight decay and learning rate warm-up as well, but do not increase the total number of normalization layers. Our experiments with transformer architectures show up to 40% faster convergence compared to GPT models with QK normalization, with only 3% additional runtime cost. When deriving scaling laws, we found that our method enables training with larger batch sizes while preserving the favorable scaling characteristics of classic GPT architectures.
FP4 All the Way: Fully Quantized Training of Large Language Models
Brian Chmiel · Maxim Fishman · Ron Banner · Daniel Soudry
We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We extensively investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods. Our analysis shows that the NVFP4 format, where each block of 16 FP4 values (E2M1) shares a scale represented in E4M3, provides optimal results. We use stochastic rounding for backward and update passes and round-to-nearest for the forward pass to enhance stability. Additionally, we identify a theoretical and empirical threshold for effective quantized training: when the gradient norm falls below approximately $\sqrt{3}$ times the quantization noise, quantized training becomes less effective. Leveraging these insights, we successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. A reference implementation is supplied in https://github.com/Anonymous1252022/fp4-all-the-way.
Adaptive Latent-Space Constraints in Personalized Federated Learning
Sana Ayromlou · David B. Emerson
Federated learning (FL) is an effective and widely used approach to training deep learning models on decentralized datasets held by distinct clients. FL also strengthens both security and privacy protections for training data. Common challenges associated with statistical heterogeneity between distributed datasets have spurred significant interest in personalized FL (pFL) methods, where models combine aspects of global learning with local modeling specific to each client’s unique characteristics. This work investigates the efficacy of theoretically supported, adaptive MMD measures in pFL, primarily focusing on the Ditto framework, a state-of-the-art technique for distributed data heterogeneity. The use of such measures significantly improves model performance across a variety of tasks, especially those with pronounced feature heterogeneity. Additional experiments demonstrate that such measures are directly applicable to other pFL techniques and yield similar improvements across a number of datasets. Finally, the results motivate the use of constraints tailored to the various kinds of heterogeneity expected in FL systems.
Language Ranker: A Lightweight Ranking framework for LLM Decoding
Chenheng Zhang · Tianqi Du · Jizhe Zhang · Mingqing Xiao · Yifei Wang · Yisen Wang · Zhouchen Lin
Conventional research on large language models (LLMs) has primarily focused on refining output distributions, while paying less attention to the decoding process that transforms these distributions into final responses. Recent advances, such as scaling the computation of inference time with reward models, have underscored the importance of decoding, but these methods often suffer from high computational costs and limited applicability. In this paper, we revisit LLM generation through the lens of recommender systems, conceptualizing the decoding process as analogous to the ranking stage in recommendation pipelines. From this perspective, we observe that both traditional decoding methods and reward models exhibit clear limitations such as redundancy. Motivated by this insight, we propose Language Ranker, a novel framework that introduces a lightweight module to rerank candidate responses using features extracted by the base model. Experiments across a wide range of tasks show that Language Ranker achieves performance comparable to large-scale reward models, while requiring only <0.5M additional parameters, significantly reducing the computational overhead during both training and inference stages. This highlights the efficiency and effectiveness of our method, showcasing its potential to fully unlock the capabilities of LLMs.
The Primacy of Magnitude in Low-Rank Adaptation
Zicheng Zhang · Haoran Li · Yifeng Zhang · Guoqiang Gong · Jiaxing Wang · Pengzhang Liu · Qixia Jiang · Junxing Hu
Low-Rank Adaptation (LoRA) offers a parameter-efficient paradigm for tuning large models. While recent spectral initialization methods improve convergence and performance over the naive “Noise \& Zeros” scheme, their extra computational and storage overhead undermines efficiency. In this paper, we establish update magnitude as the fundamental driver of LoRA performance and propose LoRAM, a magnitude-driven “Basis \& Basis” initialization scheme that matches spectral methods without their inefficiencies. Our key contributions are threefold: (i) Magnitude of weight updates determines convergence. We prove low-rank structures intrinsically bound update magnitudes, unifying hyperparameter tuning in learning rate, scaling factor, and initialization as mechanisms to optimize magnitude regulation. (ii) Spectral initialization succeeds via magnitude amplification. We demystify that the presumed knowledge-driven benefit of spectral component essentially arises from the boost in the weight update magnitude. (iii) A novel and compact initialization strategy, LoRAM, scales deterministic orthogonal bases using pretrained weight magnitudes to simulate spectral gains. Extensive experiments show that LoRAM serves as a strong baseline, retaining the full efficiency of LoRA while matching or outperforming spectral initialization across benchmarks.
Scaling Laws for Optimal Data Mixtures
Mustafa Shukor · Louis Bethune · Dan Busbridge · David Grangier · Enrico Fini · Alaaeldin El-Nouby · Pierre Ablin
Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size $N$ trained with $D$ tokens and a specific domain weight vector $h$. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget ($N$,$D$), providing a principled alternative to costly trial-and-error methods.
DISCO: DISCrete nOise for Conditional Control in Text-to-Image Diffusion Models
Longquan Dai · Wu Ming · Dejiao Xue · wang he · Jinhui Tang
A major challenge in using diffusion models is aligning outputs with user-defined conditions. Existing conditional generation methods fall into two major categories: classifier-based guidance, which requires differentiable target models and gradient-based correction; and classifier-free guidance, which embeds conditions directly into the diffusion model but demands expensive joint training and architectural coupling. In this work, we introduce a third paradigm: DISCrete nOise (DISCO) guidance, which replaces the continuous conditional correction term with a finite codebook of discrete noise vectors sampled from a Gaussian prior. Conditional generation is reformulated as a code selection task, and we train prediction network to choose the optimal code given the intermediate diffusion state and the conditioning input. Our approach is differentiability-free, and training-efficient, avoiding the gradient computation and architectural redundancy of prior methods. Empirical results demonstrate that DISCO achieves competitive controllability while substantially reducing resource demands, positioning it as a scalable and effective alternative for conditional diffusion generation.
From Softmax to Score: Transformers Can Effectively Implement In-Context Denoising Steps
Paul Rosu · Lawrence Carin · Xiang Cheng
Transformers have emerged as powerful meta-learners, with growing evidence that they implement learning algorithms within their forward pass. We study this phenomenon in the context of denoising, presenting a unified framework that shows Transformers can implement (a) manifold denoising via Laplacian flows, (b) score-based denoising from diffusion models, and (c) a generalized form of anisotropic diffusion denoising. Our theory establishes exact equivalence between Transformer attention updates and these algorithms. Empirically, we validate these findings on image denoising tasks, showing that even simple Transformers can perform robust denoising both with and without context. These results illustrate the Transformer’s flexibility as a denoising meta-learner. Code available at https://github.com/paulrosu11/TransformersareDiffusion_Denoisers.
FlashBias: Fast Computation of Attention with Bias
Haixu Wu · Minghao Guo · Yuezhou Ma · Yuanxu Sun · Jianmin Wang · Wojciech Matusik · Mingsheng Long
Attention with bias, which extends standard attention by introducing prior knowledge as an additive bias matrix to the query-key scores, has been widely deployed in vision, language, protein-folding and other advanced scientific models, underscoring its status as a key evolution of this foundational module. However, introducing bias terms creates a severe efficiency bottleneck in attention computation. It disrupts the tightly fused memory-compute pipeline that underlies the speed of accelerators like FlashAttention, thereby stripping away most of their performance gains and leaving biased attention computationally expensive. Surprisingly, despite its common usage, targeted efficiency optimization for attention with bias remains absent, which seriously hinders its application in complex tasks. Diving into the computation of FlashAttention, we prove that its optimal efficiency is determined by the rank of the attention weight matrix. Inspired by this theoretical result, this paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases and a fast-accurate approximation for biases in general formalizations. FlashBias can fully take advantage of the extremely optimized matrix multiplication operation in modern GPUs, achieving 1.5$\times$ speedup for Pairformer in AlphaFold 3, and over 2$\times$ speedup for attention with bias in vision and language models without loss of accuracy. Code is available at this repository: https://github.com/thuml/FlashBias.
Inference-Time Hyper-Scaling with KV Cache Compression
Adrian Łańcucki · Konrad Staniszewski · Piotr Nawrot · Edoardo Maria Ponti
Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key–value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8× compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference latency and memory load. For instance, we enhance Qwen-R1 32B by 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench on average for an equivalent number of memory reads.
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Qianhui Wu · Kanzhi Cheng · Rui Yang · Chaoyun Zhang · Jianwei Yang · Huiqiang Jiang · Jian Mu · Baolin Peng · Bo Qiao · Reuben Tan · Si Qin · Lars Liden · Qingwei Lin · Huan Zhang · Tong Zhang · Jianbing Zhang · Dongmei Zhang · Jianfeng Gao
One of the principal challenges in building VLM-powered GUI agents is visual grounding—localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment due to lack of explicit spatial supervision; inability to handle ambiguous supervision targets, as single-point predictions penalize valid variations; and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated `` token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B achieves scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones, outperforming UI-TARS-72B (38.1) on ScreenSpot-Pro, with significantly fewer parameters and training data. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths. Project page: https://aka.ms/GUI-Actor
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
Zhaorun Chen · Zichen Wen · Yichao Du · Yiyang Zhou · Chenhang Cui · Siwei Han · Jen Weng · Chaoqi Wang · Zhengwei Tong · Leria HUANG · Canyu Chen · Haoqin Tu · Qinghao Ye · Zhihong Zhu · Yuqing Zhang · Jiawei Zhou · Zhuokai Zhao · Rafael Rafailov · Chelsea Finn · Huaxiu Yao
While text-to-image models like GPT-4o-Image and FLUX are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across six key perspectives: alignment, safety, image quality, bias, composition, and visualization. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs, and close-source VLMs on each decomposed subcategory of our preference dataset. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average. Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding text-image alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities. Further studies in feedback scale reveal that VLM judges can generally provide more accurate and stable feedback in natural language than numerical scales. Notably, human evaluations on end-to-end and fine-tuned models using separate feedback from these multimodal judges provide similar conclusions, further confirming the effectiveness of MJ-Bench.
LIFEBENCH: Evaluating Length Instruction Following in Large Language Models
Wei Zhang · Zhenhong Zhou · Kun Wang · Junfeng Fang · Rongwu Xu · Yuanhe Zhang · Rui Wang · Ge Zhang · Xinfeng Li · Li Sun · Lingjuan Lyu · Yang Liu · Sen Su
While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions—e.g., write a 10,000-word novel. Additionally, models often generate far too short outputs, terminate prematurely, or even refuse the request. Existing benchmarks focus primarily on evaluating generations quality, but often overlook whether the generations meet length constraints. To this end, we introduce Length Instruction Following Evaluation Benchmark (LIFEBench) to comprehensively evaluate LLMs' ability to follow length instructions across diverse tasks and a wide range of specified lengths. LIFEBench consists of 10,800 instances across 4 task categories in both English and Chinese, covering length constraints ranging from 16 to 8192 words. We evaluate 26 widely-used LLMs and find that most models reasonably follow short-length instructions but deteriorate sharply beyond a certain threshold. Surprisingly, almost all models fail to reach the vendor-claimed maximum output lengths in practice, as further confirmed by our evaluations extending up to 32K words. Even long-context LLMs, despite their extended input-output windows, counterintuitively fail to improve length-instructions following. Notably, Reasoning LLMs outperform even specialized long-text generation models, achieving state-of-the-art length following. Overall, LIFEBench uncovers fundamental limitations in current LLMs' length instructions following ability, offering critical insights for future progress.
Prompting is the primary method by which we study and control large language models. It is also one of the most powerful: nearly every major capability attributed to LLMs—few-shot learning, chain-of-thought, constitutional AI—was first unlocked through prompting. Yet prompting is rarely treated as science and is frequently frowned upon as alchemy. We argue that this is a category error. If we treat LLMs as a new kind of organism—complex, opaque, and trained rather than programmed—then prompting is not a workaround. It is behavioral science. Mechanistic interpretability peers into the neural substrate, prompting probes the model in its native interface: language. We argue that prompting is not inferior, but rather a key component in the science of LLMs.
Large Language Models Miss the Multi-agent Mark
Emanuele La Malfa · Gabriele La Malfa · Samuele Marro · Jie Zhang · Elizabeth Black · Michael Luck · Philip Torr · Michael Wooldridge
Recent interest in Multi-Agent Systems of Large Language Models (MAS LLMs) has led to an increase in frameworks leveraging multiple LLMs to tackle complex tasks. However, much of this literature appropriates the terminology of MAS without engaging with its foundational principles. In this position paper, we highlight critical discrepancies between MAS theory and current MAS LLMs implementations, focusing on four key areas: the social aspect of agency, environment design, coordination and communication protocols, and measuring emergent behaviours. Our position is that many MAS LLMs lack multi-agent characteristics such as autonomy, social interaction, and structured environments, and often rely on oversimplified, LLM-centric architectures. The field may slow down and lose traction by revisiting problems the MAS literature has already addressed. Therefore, we systematically analyse this issue and outline associated research opportunities; we advocate for better integrating established MAS concepts and more precise terminology to avoid mischaracterisation and missed opportunities.
Single GPU Task Adaptation of Pathology Foundation Models for Whole Slide Image Analysis
Neeraj Kumar · Chad Vanderbilt
Pathology foundation models (PFMs) have emerged as powerful tools for analyzing whole slide images (WSIs). However, adapting these pretrained PFMs for specific clinical tasks presents considerable challenges, primarily due to the availability of only weak (WSI-level) labels for gigapixel images, necessitating multiple instance learning (MIL) paradigm for effective WSI analysis. This paper proposes a novel approach for single-GPU \textbf{T}ask \textbf{A}daptation of \textbf{PFM}s (TAPFM) that uses vision transformer (\vit) attention for MIL aggregation while optimizing both for feature representations and attention weights. The proposed approach maintains separate computational graphs for MIL aggregator and the PFM to create stable training dynamics that align with downstream task objectives during end-to-end adaptation. Evaluated on mutation prediction tasks for bladder cancer and lung adenocarcinoma across institutional and The Cancer Genome Atlas (TCGA) cohorts, TAPFM consistently outperforms conventional approaches, with H-Optimus-0 (TAPFM) outperforming the benchmarks. TAPFM effectively handles multi-label classification of actionable mutations as well. Thus, TAPFM makes adaptation of powerful pre-trained PFMs practical on standard hardware for various clinical applications.
DePass: Unified Feature Attributing by Simple Decomposed Forward Pass
Xiangyu Hong · Che Jiang · Kai Tian · Biqing Qi · Youbang Sun · Ning Ding · Bowen Zhou
Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass. DePass decomposes hidden states into customized additive components, then propagates them with attention scores and MLP's activations fixed. It achieves faithful, fine-grained attribution without requiring auxiliary training. We validate DePass across token-level, model component-level, and subspace-level attribution tasks, demonstrating its effectiveness and fidelity. Our experiments highlight its potential to attribute information flow between arbitrary components of a Transformer model. We hope DePass serves as a foundational tool for broader applications in interpretability.
ZeCO: Zero-Communication Overhead Sequence Parallelism for Linear Attention
Yuhong CHOU · Zehao Liu · Rui-Jie Zhu · Xinyi Wan · Tianjian Li · Congying Chu · Qian Liu · Jibin Wu · Zejun MA
Linear attention mechanisms deliver significant advantages for Large Language Models (LLMs) by providing linear computational complexity, enabling efficient processing of ultra-long sequences (e.g., 1M context). However, existing Sequence Parallelism (SP) methods, essential for distributing these workloads across devices, become the primary performance bottleneck due to substantial communication overhead. In this paper, we introduce ZeCO (Zero Communication Overhead) sequence parallelism for linear attention models, a new SP method designed to overcome these limitations and achieve practically end-to-end near-linear scalability for long sequence training. For example, training a model with a 1M sequence length across 64 devices using ZeCO takes roughly the same time as training with an 16k sequence on a single device. At the heart of ZeCO lies All-Scan, a novel collective communication primitive. All-Scan provides each SP rank with precisely the initial operator state it requires while maintaining a minimal communication footprint, effectively eliminating communication overhead. Theoretically, we prove the optimaity of ZeCO, showing that it introduces only negligible time and space overhead. Empirically, we compare the communication costs of different sequence parallelism strategies and demonstrate that All-Scan achieves the fastest communication in SP scenarios. Specifically, on 256 GPUs with an 8M sequence length, ZeCO achieves a 60\% speedup compared to the current state-of-the-art (SOTA) SP method. We believe ZeCO establishes a clear path toward efficiently training next-generation LLMs on previously intractable sequence lengths.
Understanding Differential Transformer Unchains Pretrained Self-Attentions
Chaerin Kong · Jiho Jang · Nojun Kwak
Differential Transformer has recently gained significant attention for its impressive empirical performance, often attributed to its ability to perform noise canceled attention. However, precisely how differential attention achieves its empirical benefits remains poorly understood. Moreover, Differential Transformer architecture demands large-scale training from scratch, hindering utilization of open pretrained weights. In this work, we conduct an in-depth investigation of Differential Transformer, uncovering three key factors behind its success: (1) enhanced expressivity via negative attention, (2) reduced redundancy among attention heads, and (3) improved learning dynamics. Based on these findings, we propose DEX, a novel method to efficiently integrate the advantages of differential attention into pretrained language models. By reusing the softmax attention scores and adding a lightweight differential operation on the output value matrix, DEX effectively incorporates the key advantages of differential attention while remaining lightweight in both training and inference. Evaluations confirm that DEX substantially improves the pretrained LLMs across diverse benchmarks, achieving significant performance gains with minimal adaptation data (< 0.01%).
Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels
Maximilian Beck · Korbinian Pöppel · Phillip Lippe · Sepp Hochreiter
Linear RNNs with gating recently demonstrated competitive performance compared to Transformers in language modeling. Although their linear compute scaling in sequence length offers theoretical runtime advantages over Transformers, realizing these benefits in practice requires optimized custom kernels, as Transformers rely on the highly efficient Flash Attention kernels (Dao, 2024). Leveraging the chunkwise-parallel formulation of linear RNNs, Flash Linear Attention (FLA) (Yang & Zhang, 2024) shows that linear RNN kernels are faster than Flash Attention, by parallelizing over chunks of the input sequence. However, since the chunk size of FLA is limited, many intermediate states must be materialized in GPU memory. This leads to low arithmetic intensity and causes high memory consumption and IO cost, especially for long-context pre-training. In this work, we present Tiled Flash Linear Attention (TFLA), a novel kernel algorithm for linear RNNs, that enables arbitrary large chunk sizes and high arithmetic intensity by introducing an additional level of sequence parallelization within each chunk. First, we apply TFLA to the xLSTM with matrix memory, the mLSTM (Beck et al., 2024). Second, we propose an mLSTM variant with sigmoid input gate and reduced computation for even faster kernel runtimes at equal language modeling performance. In our speed benchmarks, we show that our new mLSTM kernels based on TFLA outperform highly optimized Flash Attention, Linear Attention and Mamba kernels, setting a new state of the art for efficient long-context sequence modeling primitives. Our code is available at: https://github.com/NX-AI/mlstm_kernels
One SPACE to Rule Them All: Jointly Mitigating Factuality and Faithfulness Hallucinations in LLMs
Pengbo Wang · Chaozhuo Li · Chenxu Wang · Liwen Zheng · Litian Zhang · Xi Zhang
LLMs have demonstrated unprecedented capabilities in natural language processing, yet their practical deployment remains hindered by persistent factuality and faithfulness hallucinations. While existing methods address these hallucination types independently, they inadvertently induce performance trade-offs, as interventions targeting one type often exacerbate the other. Through empirical and theoretical analysis of activation space dynamics in LLMs, we reveal that these hallucination categories share overlapping subspaces within neural representations, presenting an opportunity for concurrent mitigation. To harness this insight, we propose SPACE, a unified framework that jointly enhances factuality and faithfulness by editing shared activation subspaces. SPACE establishes a geometric foundation for shared subspace existence through dual-task feature modeling, then identifies and edits these subspaces via a hybrid probe strategy combining spectral clustering and attention head saliency scoring. Experimental results across multiple benchmark datasets demonstrate the superiority of our approach.
Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers
Woomin Song · Sai Muralidhar Jayanthi · Srikanth Ronanki · Kanthashree Mysore Sathyendra · Jinwoo Shin · Aram Galstyan · Shubham Katiyar · Sravan Babu Bodapati
As large language models increasingly gain popularity in real-world applications, processing extremely long contexts, often exceeding the model’s pre-trained context limits, has emerged as a critical challenge. While existing approaches to efficient long-context processing show promise, recurrent compression-based methods struggle with information preservation, whereas random access approaches require substantial memory resources. We introduce REFORM, a novel inference framework that efficiently handles long contexts through a two-phase approach. First, it incrementally processes input chunks while maintaining a compressed KV cache, constructs cross-layer context embeddings, and utilizes early exit strategy for improved efficiency. Second, it identifies and gathers essential tokens via similarity matching and selectively recomputes the KV cache. Compared to baselines, REFORM achieves over 50% and 27% performance gains on RULER and BABILong respectively at 1M context length. It also outperforms baselines on ∞-Bench, RepoEval, and MM-NIAH, demonstrating flexibility across diverse tasks and domains. Additionally, REFORM reduces inference time by 30% and peak memory usage by 5%, achieving both efficiency and superior performance.
Accurate KV Cache Eviction via Anchor Direction Projection for Efficient LLM Inference
Zijie Geng · Jie Wang · Ziqi Liu · Feng Ju · Yiming Li · Xing Li · Mingxuan Yuan · Jianye Hao · Defu Lian · Enhong Chen · Feng Wu
Key-Value (KV) cache eviction---which retains the KV pairs of the most important tokens while discarding less important ones---is a critical technique for optimizing both memory usage and inference latency in large language models (LLMs). However, existing approaches often rely on simple heuristics---such as attention weights---to measure token importance, overlooking the spatial relationships between token value states in the vector space. This often leads to suboptimal token selections and thus performance degradation. To tackle this problem, we propose a novel method, namely **AnDPro** (**An**chor **D**irection **Pro**jection), which introduces a projection-based scoring function to more accurately measure token importance. Specifically, AnDPro operates in the space of value vectors and leverages the projections of these vectors onto an *``Anchor Direction''*---the direction of the pre-eviction output---to measure token importance and guide more accurate token selection. Experiments on $16$ datasets from the LongBench benchmark demonstrate that AnDPro can maintain $96.07\\%$ of the full cache accuracy using only $3.44\\%$ KV cache budget, reducing KV cache budget size by $46.0\\%$ without compromising quality compared to previous state-of-the-arts.
TransMLA: Migrating GQA Models to MLA with Full DeepSeek Compatibility and Speedup
Fanxu Meng · Pingzhi Tang · Zengwei Yao · Xing Sun · Muhan Zhang
Modern large-language models often face communication bottlenecks on current hardware rather than computational limitations. Multi-head latent attention (MLA) addresses this by compressing the key-value cache using low-rank matrices, while the Absorb operation prevents the KV cache from reverting to its original size, significantly boosting both training and inference speed. Despite the success of DeepSeek V2/V3/R1, most model providers have heavily invested in optimizing GQA-based models and, therefore, lack strong incentives to retrain MLA-based models from scratch. This paper demonstrates that MLA provides superior expressive power compared to GQA with the same KV cache overhead, thereby offering a rationale for transitioning from GQA to MLA. In addition, we introduce TransMLA, a framework that seamlessly converts any GQA-based pre-trained model (e.g., LLaMA, Qwen, Gemma, Mistral/Mixtral) into an MLA-based model. For the first time, our method enables direct conversion of these models into a format compatible with DeepSeek's codebase, allowing them to fully leverage the existing, highly-optimized support for the DeepSeek architecture within inference engines like vLLM and SGlang. By compressing 93\% of the KV cache in LLaMA-2-7B, we achieve a 10x speedup with an 8K context length while maintaining meaningful output. Moreover, the model requires only 6B tokens for fine-tuning to recover comparable performance across multiple benchmarks. TransMLA provides a practical path for migrating GQA-based models to the MLA structure, and when combined with DeepSeek’s advanced optimizations—such as FP8 quantization and Multi-Token Prediction—further inference acceleration can be achieved.
DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products
Julien Siems · Timur Carstensen · Arber Zela · Frank Hutter · Massimiliano Pontil · Riccardo Grazzi
Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference. However, existing architectures face a fundamental trade-off between expressivity and efficiency, dictated by the structure of their state-transition matrices. Diagonal matrices, used in models such as Mamba, GLA, or mLSTM, yield fast runtime but have limited expressivity. To address this, recent architectures such as DeltaNet and RWKV-7 adopted a diagonal plus rank-1 structure, which allows simultaneous token and channel mixing, improving associative recall and, as recently shown, state-tracking when allowing negative eigenvalues in the state-transition matrices. Building on the interpretation of DeltaNet's recurrence as performing one step of online gradient descent per token on an associative recall loss, we introduce DeltaProduct, which instead takes multiple ($n_h$) steps per token. This naturally leads to diagonal plus rank-$n_h$ state-transition matrices, formed as products of $n_h$ generalized Householder transformations, providing a tunable mechanism to balance expressivity and efficiency. We provide a detailed theoretical characterization of the state-tracking capability of DeltaProduct in finite precision and how it improves by increasing $n_h$. Our extensive experiments demonstrate that DeltaProduct outperforms DeltaNet in both state-tracking and language modeling, while also showing significantly improved length extrapolation capabilities.
Learning Linear Attention in Polynomial Time
Morris Yau · Ekin Akyürek · Jiayuan Mao · Josh Tenenbaum · Stefanie Jegelka · Jacob Andreas
Previous research has explored the expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the efficient learnability of Transformers from data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that learning the optimal multi head linear attention can be recast as finding the optimal kernel predictor in a suitably defined RKHS. Moving to generalization, we construct an algorithm that, given a dataset, checks in polynomial time whether the set of best fit multi head linear attention networks on this data all perform an identical computation--a powerful notion for out of distribution generalization. We empirically validate our theoretical findings on several canonical tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformer models.
What are you sinking? A geometric approach on attention sink
Valeria Ruscio · Umberto Nanni · Fabrizio Silvestri
Attention sink (AS) is a consistent pattern in transformer attention maps where certain tokens (often special tokens or positional anchors) disproportionately attract attention from other tokens. We show that in transformers, AS is not an architectural artifact, but it is the manifestation of a fundamental geometric principle: the establishment of reference frames that anchor representational spaces. We analyze several architectures and identify three distinct reference frame types, centralized, distributed, and bidirectional, that correlate with the attention sink phenomenon. We show that they emerge during the earliest stages of training as optimal solutions to the problem of establishing stable coordinate systems in high-dimensional spaces. We show the influence of architecture components, particularly position encoding implementations, on the specific type of reference frame. This perspective transforms our understanding of transformer attention mechanisms and provides insights for both architecture design and the relationship with AS.
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction
Jeffrey Willette · Heejun Lee · Sung Ju Hwang
The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention. Our method can be applied on top of any sparse attention method, and results in an average 36\%pt performance increase, recovering 88\% of quadratic attention accuracy on the 131K RULER benchmark when applied on top of sliding window attention with sink tokens while only adding a small overhead. Our method can maintain approximately 98.5\% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
Polyline Path Masked Attention for Vision Transformer
Zhongchen Zhao · Chaodong Xiao · Hui LIN · Qi Xie · Lei Zhang · Deyu Meng
Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self-attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state-of-the-art approaches based on both state-space models and Transformers. For example, our proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at https://github.com/zhongchenzhao/PPMA.
Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials
Yifan Pu · Jixuan Ying · Qixiu Li · Tianzhu Ye · Dongchen Han · Xiaochen Wang · Ziyi Wang · shao xinyu · Gao Huang · Xiu Li
Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi–Head Self–Attention (MHSA) layer still performs a quadratic query–key interaction for \emph{every} token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce \emph{Visual–Contrast Attention} (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from $\mathcal{O}(N^{2}C)$ to $\mathcal{O}(N n C)$ with $n\!\ll\!N$. VCA first distils each head’s dense query field into a handful of spatially pooled \emph{visual–contrast tokens}, then splits them into a learnable \emph{positive} and \emph{negative} stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than $0.3$\,M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from $72.2\%$ to \textbf{$75.6\%$} (+$3.4$) and improves three strong hierarchical ViTs by up to $3.1$\%, while in class-conditional ImageNet generation it lowers FID-50K by $2.1$ to $5.2$ points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at \href{https://github.com/LeapLabTHU/LinearDiff}{https://github.com/LeapLabTHU/LinearDiff}.
A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective
Lianghe Shi · Meng Wu · Huijie Zhang · Zekai Zhang · Molei Tao · Qing Qu
The widespread use of diffusion models has led to an abundance of AI-generated data, raising concerns about model collapse---a phenomenon in which recursive iterations of training on synthetic data lead to performance degradation. Prior work primarily characterizes this collapse via variance shrinkage or distribution shift, but these perspectives miss practical manifestations of model collapse. This paper identifies a transition from generalization to memorization during model collapse in diffusion models, where models increasingly replicate training data instead of generating novel content during iterative training on synthetic samples. This transition is directly driven by the declining entropy of the synthetic training data produced in each training cycle, which serves as a clear indicator of model degradation. Motivated by this insight, we propose an entropy-based data selection strategy to mitigate the transition from generalization to memorization and alleviate model collapse. Empirical results show that our approach significantly enhances visual quality and diversity in recursive generation, effectively preventing collapse.
Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification
Zinan Lin · Enshu Liu · Xuefei Ning · Junyi Zhu · Wenyu Wang · Sergey Yekhanin
Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59—without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latentzoningnetworks.html.
Traversal Verification for Speculative Tree Decoding
Yepeng Weng · Qiao Hu · Xujie Chen · Li Liu · Dianwen Mei · Huishi Qiu · Jiang Tian · zhongchao shi
Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in parallel to determine whether the drafted tokens should be accepted or rejected. To enhance acceptance rates, existing frameworks typically construct token trees containing multiple candidates in each timestep. However, their reliance on token-level verification mechanisms introduces two critical limitations: First, the probability distribution of a sequence differs from that of individual tokens, leading to suboptimal acceptance length. Second, current verification schemes begin from the root node and proceed layer by layer in a top-down manner. Once a parent node is rejected, all its child nodes should be discarded, resulting in inefficient utilization of speculative candidates. This paper introduces Traversal Verification, a novel speculative decoding algorithm that fundamentally rethinks the verification paradigm through leaf-to-root traversal. Our approach considers the acceptance of the entire token sequence from the current node to the root, and preserves potentially valid subsequences that would be prematurely discarded by existing methods. We theoretically prove that the probability distribution obtained through Traversal Verification is identical to that of the target model, guaranteeing lossless inference while achieving substantial acceleration gains. Experimental results on various models and multiple tasks demonstrate that our method consistently improves acceptance length and throughput over token-level verification.
Information-Theoretic Discrete Diffusion
Moongyu Jeon · Sangwoo Shin · Dongjae Jeon · Albert No
We present an information-theoretic framework for discrete diffusion models that yields principled estimators of log-likelihood using score-matching losses. Inspired by the I-MMSE identity for the Gaussian setup, we derive analogous results for the discrete setting. Specifically, we introduce the Information–Minimum Denoising Score Entropy (I-MDSE) relation, which links mutual information between data and its diffused version to the minimum denoising score entropy (DSE) loss. We extend this theory to masked diffusion and establish the Information–Minimum Denoising Cross-Entropy (I-MDCE) relation, connecting cross-entropy losses to mutual information in discrete masked processes. These results provide a time-integral decomposition of the log-likelihood of the data in terms of optimal score-based losses, showing that commonly used losses such as DSE and DCE are not merely variational bounds but tight and principled estimators of log-likelihood. The I-MDCE decomposition further enables practical extensions, including time-free formula, conditional likelihood estimation in prompt–response tasks, and coupled Monte Carlo estimation of likelihood ratios. Experiments on synthetic and real-world data confirm the accuracy, variance stability, and utility of our estimators. The code is publicly available at https://github.com/Dongjae0324/infodis.
Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning
Haozhen Zhang · Tao Feng · Jiaxuan You
The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (\textit{i.e.}, assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present \textbf{Router-R1}, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave "think" actions (internal deliberation) with "route" actions (dynamic model invocation), and integrates each response into its evolving context. To facilitate learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for optimizing the balance between performance and cost, opening a pathway toward enhancing performance-cost trade-offs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms several strong baselines, achieving superior performance while maintaining robust generalization and cost management.
Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models
Luca Eyring · Shyamgopal Karthik · Alexey Dosovitskiy · Nataniel Ruiz · Zeynep Akata
The new paradigm of test-time scaling has yielded remarkable breakthroughs in Large Language Models (LLMs) (e.g. reasoning models) and in generative vision models, allowing models to allocate additional computation during inference to effectively tackle increasingly complex problems. Despite the improvements of this approach, an important limitation emerges: the substantial increase in computation time makes the process slow and impractical for many applications. Given the success of this paradigm and its growing usage, we seek to preserve its benefits while eschewing the inference overhead. In this work we propose one solution to the critical problem of integrating test-time scaling knowledge into a model during post-training. Specifically, we replace reward guided test-time noise optimization in diffusion models with a Noise Hypernetwork that modulates initial input noise. We propose a theoretically grounded framework for learning this reward-tilted distribution for distilled generators, through a tractable noise-space objective that maintains fidelity to the base model while optimizing for desired characteristics. We show that our approach recovers a substantial portion of the quality gains from explicit test-time optimization at a fraction of the computational cost.
Quantifying Elicitation of Latent Capabilities in Language Models
Elizabeth Donoway · Hailey Joren · Arushi Somani · Henry Sleight · Julian Michael · Michael Deweese · John Schulman · Ethan Perez · Fabien Roger · Jan Leike
Large language models often possess latent capabilities that lie dormant unless explicitly elicited, or surfaced, through fine-tuning or prompt engineering. Predicting, assessing, and understanding these latent capabilities pose significant challenges in the development of effective, safe AI systems. In this work, we recast elicitation as an information-constrained fine-tuning problem and empirically characterize upper bounds on the minimal number of parameters needed to achieve specific task performances. We find that training as few as 10–100 randomly chosen parameters—several orders of magnitude fewer than state-of-the-art parameter-efficient methods—can recover up to 50\% of the performance gap between pretrained-only and full fine-tuned models, and 1,000s to 10,000s of parameters can recover 95\% of this performance gap. We show that a logistic curve fits the relationship between the number of trained parameters and model performance gap recovery. This scaling generalizes across task formats and domains, as well as model sizes and families, extending to reasoning models and remaining robust to increases in inference compute. To help explain this behavior, we consider a simplified picture of elicitation via fine-tuning where each trainable parameter serves as an encoding mechanism for accessing task-specific knowledge. We observe a relationship between the number of trained parameters and how efficiently relevant model capabilities can be accessed and elicited, offering a potential route to distinguish elicitation from teaching.
Learning to Generate Human-Human-Object Interactions from Textual Descriptions
Jeonghyeon Na · Sangwon Baik · Inhee Lee · Junyoung Lee · Hanbyul Joo
The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
Mathurin VIDEAU · Badr Youbi Idrissi · Alessandro Leite · Marc Schoenauer · Olivier Teytaud · David Lopez-Paz
Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and leave the model stuck with that choice. We relax this rigidity by introducing an autoregressive U-Net that learns to embed its own tokens as it trains. The network reads raw bytes, pools them into words, then pairs of words, then up to 4 words, giving it a multi-scale view of the sequence. At deeper stages, the model must predict further into the future -- anticipating the next few words rather than the next byte -- so deeper stages focus on broader semantic patterns while earlier stages handle fine details. When carefully tuning and controlling pretraining compute, shallow hierarchies tie strong BPE baselines, and deeper hierarchies have a promising trend. Because tokenization now lives inside the model, the same system can handle character-level tasks and carry knowledge across low-resource languages.
Scaling Speculative Decoding with Lookahead Reasoning
Yichao Fu · Rui Ge · Zelei Shao · Zhijie Deng · Hao Zhang
Reasoning models excel by generating long chain-of-thoughts, but decoding the resulting thousands of tokens is slow. Token-level specualtive decoding (SD) helps, but its benefit is capped, because the chance that an entire $\gamma$-token guess is correct falls exponentially as $\gamma$ grows. This means allocating more compute for longer token drafts faces an algorithmic ceiling -- making the speedup modest and hardware-agnostic. We raise this ceiling with lookahead reasoning, which exploits a second, step-level layer of parallelism. Our key insight is that reasoning models generate step-by-step, and each step needs only to be semantically correct, not exact token matching. In lookahead reasoning, a lightweight draft model proposes several future steps; the target model expands each proposal in one batched pass, and a verifier keeps semantically correct steps while letting the target regenerate any that fail. Token-level SD still operates within each reasoning step, so the two layers of parallelism multiply. We show lookahead reasoning lifts the peak speedup of SD both theoretically and empirically. Across GSM8K, AIME, and other benchmarks, lookahead reasoning improves the speedup of SD from 1.4x to 2.1x while preserving answer quality, and its speedup scales better with additional GPU throughput. Our code is available at https://github.com/hao-ai-lab/LookaheadReasoning
Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models
Haoyi Song · Ruihan Ji · Naichen Shi · Fan Lai · Raed AL Kontar
Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ). Existing UQ methods are often heuristic and lack a fully probabilistic foundation. This paper begins by providing a theoretical justification for the role of perturbations in UQ for LLMs. We then introduce a dual random walk perspective, modeling input–output pairs as two Markov chains with transition probabilities defined by semantic similarity. Building on this, we propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations. Within this framework, we define a new uncertainty measure, Inv-Entropy. A key strength of our framework is its flexibility: it supports various definitions of uncertainty measures, embeddings, perturbation strategies, and similarity metrics. We also propose GAAP, a perturbation algorithm based on genetic algorithms, which enhances the diversity of sampled inputs. In addition, we introduce a new evaluation metric, Temperature Sensitivity of Uncertainty (TSU), which directly assesses uncertainty without relying on correctness as a proxy. Extensive experiments demonstrate that Inv-Entropy outperforms existing semantic UQ methods.
BREAD: Branched Rollouts from Expert Anchors Bridge SFT & RL for Reasoning
Xuechen Zhang · Zijian Huang · Yingcong Li · Chenshun Ni · Jiasi Chen · Samet Oymak
Small language models (SLMs) struggle to learn complex reasoning behaviors, especially when high-quality traces are scarce or difficult to learn from. A typical approach for training such models combines a supervised fine-tuning (SFT) stage, often to distill reasoning capabilities from a larger model, followed by a reinforcement learning (RL) stage such as Group Relative Policy Optimization (GRPO). In this paper, we investigate the fundamental limitations of this SFT + RL paradigm and propose methods to overcome them. Using a toy student-expert model over Markov chains, we demonstrate that the SFT + RL strategy can fail completely when (1) the expert's traces are too difficult for the small model to express, or (2) the small model's initialization achieves exponentially sparse rewards as task complexity grows. To address these, we introduce BREAD, a GRPO variant that bridges SFT and RL via partial expert guidance and branch rollouts. When self-generated traces fail, BREAD adaptively inserts short expert prefixes/hints, allowing the small model to complete the rest of the reasoning path, and ensuring that each update includes at least one successful trace. This mechanism both densifies the reward signal and induces a natural learning curriculum. BREAD requires fewer than 40\% of ground-truth traces, consistently outperforming standard GRPO while speeding up the training by about 3$\times$. Importantly, we find that BREAD helps the model solve problems that are otherwise unsolvable by the SFT + RL strategy, highlighting how branch rollouts and expert guidance can aid SLM reasoning.
From Style to Facts: Mapping the Boundaries of Knowledge Injection with Finetuning
Eric Zhao · Pranjal Awasthi · Nika Haghtalab
Finetuning provides a scalable and cost-effective means of customizing language models for specific tasks or response styles, with greater reliability than prompting or in-context learning. In contrast, the conventional wisdom is that injecting knowledge via finetuning results in brittle performance and poor generalization. We argue that the dichotomy of "task customization" (e.g., instruction tuning) and "knowledge injection" (e.g., teaching new facts) is a distinction without a difference. We instead identify concrete factors that explain the heterogeneous effectiveness observed with finetuning. To this end, we conduct a large-scale experimental study of finetuning the frontier Gemini v1.5 model family on a spectrum of datasets that are artificially engineered to interpolate between the strengths and failure modes of finetuning. Our findings indicate that question-answer training data formats provide much stronger knowledge generalization than document/article-style training data, numerical information can be harder for finetuning to retain than categorical information, and models struggle to apply finetuned knowledge during multi-step reasoning even when trained on similar examples---all factors that render ``knowledge injection'' to be especially difficult, even after controlling for considerations like data augmentation and information volume. On the other hand, our findings also indicate that it is not fundamentally more difficult to finetune information about a real-world event than information about writing style.
SPACE: Noise Contrastive Estimation Stabilizes Self-Play Fine-Tuning for Large Language Models
Yibo Wang · Guangda Huzhang · Qingguo Chen · Zhao Xu · Weihua Luo · Kaifu Zhang · Lijun Zhang
Self-play fine-tuning has demonstrated promising abilities in adapting large language models (LLMs) to downstream tasks with limited real-world data. The basic principle is to iteratively refine the model with real samples and synthetic ones generated from itself. However, the existing methods primarily focus on the relative gaps between the rewards for two types of data, neglecting their absolute values. Through theoretical analysis, we identify that the gap-based methods suffer from unstable evolution, due to the potentially degenerated objectives. To address this limitation, we introduce a novel self-play fine-tuning method, namely \underline{S}elf-\underline{P}l\underline{A}y via Noise \underline{C}ontrastive \underline{E}stimation (SPACE), which leverages noise contrastive estimation to capture the real-world data distribution. Specifically, SPACE treats synthetic samples as auxiliary components, and discriminates them from the real ones in a binary classification manner. As a result, SPACE independently optimizes the absolute reward values for each type of data, ensuring a consistently meaningful objective and thereby avoiding the instability issue. Theoretically, we show that the optimal solution of the objective in SPACE aligns with the underlying distribution of real-world data, and SPACE guarantees a provably stable convergence to the optimal distribution. Empirically, we show that SPACE significantly improves the performance of LLMs over various tasks, and outperforms supervised fine-tuning that employs much more real-world samples. Compared to gap-based self-play fine-tuning methods, SPACE exhibits remarkable superiority and stable evolution.
When and how can inexact generative models still sample from the data manifold?
Nisha Chandramoorthy · Adriaan de Clercq
A curious phenomenon observed in some dynamical generative models is the following: despite learning errors in the score function or the drift vector field, the generated samples appear to shift \emph{along} the support of the data distribution but not \emph{away} from it. In this work, we investigate this phenomenon of \emph{robustness of the support} by taking a dynamical systems approach on the generating stochastic/deterministic process. Our perturbation analysis of the probability flow reveals that infinitesimal learning errors cause the predicted density to be different from the target density only on the data manifold for a wide class of generative models. Further, what is the dynamical mechanism that leads to the robustness of the support? We show that the alignment of the top Lyapunov vectors (most sensitive infinitesimal perturbation directions) with the tangent spaces along the boundary of the data manifold leads to robustness and prove a sufficient condition on the dynamics of the generating process to achieve this alignment. Moreover, the alignment condition is efficient to compute and, in practice, for robust generative models, automatically leads to accurate estimates of the tangent bundle of the data manifold. Using a finite-time linear perturbation analysis on samples paths as well as probability flows, our work complements and extends existing works on obtaining theoretical guarantees for generative models from a stochastic analysis, statistical learning and uncertainty quantification points of view. Our results apply across different dynamical generative models, such as conditional flow-matching and score-based generative models, and for different target distributions that may or may not satisfy the manifold hypothesis.
One-Step Offline Distillation of Diffusion-based Models via Koopman Modeling
Nimrod Berman · Ilan Naiman · Moshe Eliasof · Hedi Zisling · Omri Azencot
Diffusion-based generative models have demonstrated exceptional performance, yet their iterative sampling procedures remain computationally expensive. A prominent strategy to mitigate this cost is distillation, with offline distillation offering particular advantages in terms of efficiency, modularity, and flexibility. In this work, we identify two key observations that motivate a principled distillation framework: (1) while diffusion models have been viewed through the lens of dynamical systems theory, powerful and underexplored tools can be further leveraged; and (2) diffusion models inherently impose structured, semantically coherent trajectories in latent space. Building on these observations, we introduce the Koopman Distillation Model (KDM), a novel offline distillation approach grounded in Koopman theory - a classical framework for representing nonlinear dynamics linearly in a transformed space. KDM encodes noisy inputs into an embedded space where a learned linear operator propagates them forward, followed by a decoder that reconstructs clean samples. This enables single-step generation while preserving semantic fidelity. We provide theoretical justification for our approach: (1) under mild assumptions, the learned diffusion dynamics admit a finite-dimensional Koopman representation; and (2) proximity in the Koopman latent space correlates with semantic similarity in the generated outputs, allowing for effective trajectory alignment. Empirically, KDM achieves state-of-the-art performance across standard offline distillation benchmarks - improving FID scores by up to 40% in a single generation step.
Improving the Euclidean Diffusion Generation of Manifold Data by Mitigating Score Function Singularity
Zichen Liu · Wei Zhang · Tiejun Li
Euclidean diffusion models have achieved remarkable success in generative modeling across diverse domains, and they have been extended to manifold cases in recent advances. Instead of explicitly utilizing the structure of special manifolds as studied in previous works, in this paper we investigate direct sampling of the Euclidean diffusion models for general manifold-structured data. We reveal the multiscale singularity of the score function in the ambient space, which hinders the accuracy of diffusion-generated samples. We then present an elaborate theoretical analysis of the singularity structure of the score function by decomposing it along the tangential and normal directions of the manifold. To mitigate the singularity and improve the sampling accuracy, we propose two novel methods: (1) Niso-DM, which reduces the scale discrepancies in the score function by utilizing a non-isotropic noise, and (2) Tango-DM, which trains only the tangential component of the score function using a tangential-only loss function. Numerical experiments demonstrate that our methods achieve superior performance on distributions over various manifolds with complex geometries.
Training-Free Constrained Generation With Stable Diffusion Models
Stefano Zampini · Jacob K Christopher · Luca Oneto · Davide Anguita · Ferdinando Fioretto
Stable diffusion models represent the state-of-the-art in data synthesis across diverse domains and hold transformative potential for applications in science and engineering, e.g., by facilitating the discovery of novel solutions and simulating systems that are computationally intractable to model explicitly. While there is increasing effort to incorporate physics-based constraints into generative models, existing techniques are either limited in their applicability to latent diffusion frameworks or lack the capability to strictly enforce domain-specific constraints. To address this limitation this paper proposes a novel integration of stable diffusion models with constrained optimization frameworks, enabling the generation of outputs satisfying stringent physical and functional requirements. The effectiveness of this approach is demonstrated through material design experiments requiring adherence to precise morphometric properties, challenging inverse design tasks involving the generation of materials inducing specific stress-strain responses, and copyright-constrained content generation tasks. All code has been released at https://github.com/RAISELab-atUVA/Constrained-Stable-Diffusion.
Continuous Diffusion Model for Language Modeling
Jaehyeong Jo · Sung Ju Hwang
Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. However, diffusion models that directly work on discrete data space fail to fully exploit the power of iterative refinement, as the signals are lost during transitions between discrete states. Existing continuous diffusion models for discrete data underperform compared to discrete methods, and the lack of a clear connection between the two approaches hinders the development of effective diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on this analogy, introduce a simple diffusion process that generalizes existing discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry, along with a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models.
CORAL: Disentangling Latent Representations in Long-Tailed Diffusion
Esther Rodriguez · Monica Welfert · Samuel McDowell · Nathan Stromberg · Julian Camarena · Lalitha Sankar
Diffusion models have achieved impressive performance in generating high-quality and diverse synthetic data. However, their success typically assumes a class-balanced training distribution. In real-world settings, multi-class data often follow a long-tailed distribution, where standard diffusion models struggle—producing low-diversity and lower-quality samples for underrepresented (tail) classes. While this degradation is well-documented, its underlying cause remains poorly understood. In this work, we investigate the behavior of diffusion models trained on long-tailed datasets and identify a key issue: the latent representations (from the bottleneck layer of the U-Net) for tail class subspaces exhibit significant overlap with those of head classes, leading to feature borrowing and poor generation quality. Importantly, we show that this is not merely due to limited data per class, but that the relative class imbalance significantly contributes to this phenomenon. To address this, we propose COntrastive Regularization for Aligning Latents (CORAL), a contrastive latent alignment framework that leverages supervised contrastive losses to encourage well-separated latent class representations. Experiments demonstrate that CORAL significantly improves both the diversity and visual quality of samples generated for tail classes relative to state-of-the-art methods.
MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees
Herbert Woisetschläger · Ryan Zhang · Shiqiang Wang · Hans Arno Jacobsen
Open-weight large language model (LLM) zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing interests are typically mediated through service level agreements (SLAs) that guarantee minimum service quality. We introduce MESS+, a stochastic optimization algorithm for cost-optimal LLM request routing while providing rigorous SLA compliance guarantees. MESS+ learns request satisfaction probabilities of LLMs in real-time as users interact with the system, based on which model selection decisions are made by solving a per-request optimization problem. Our algorithm includes a novel combination of virtual queues and request satisfaction prediction, along with a theoretical analysis of cost optimality and constraint satisfaction. Across a wide range of state-of-the-art LLM benchmarks, MESS+ achieves an average of $2\times$ cost savings compared to existing LLM routing techniques.
Support Vector Generation: Kernelizing Large Language Models for Efficient Zero‑Shot NLP
Shohei Ohsawa
We introduce Support Vector Generation (SVG), a kernel-based framework that converts a frozen language model into an interpretable, training-free classifier for zero- and few-shot learning. SVG operates by combining Metropolis–Hastings sampling with support vector machine optimization in the reproducing kernel Hilbert space (RKHS) induced by the language model's embedding. Each classification decision is based on a weighted combination of at most 32 natural-language sentences, which serve as explicit support vectors and provide faithful rationales. Our theoretical analysis proves that SVG minimizes the empirical hinge loss over the span of the supports and admits a generalization bound independent of the language model size. Experiments on the GLUE benchmark show that SVG matches or surpasses prompting-based zero-shot baselines in accuracy across multiple tasks—without any fine-tuning or GPU acceleration. Notably, our CPU-only implementation completes training in under three minutes per task, and maintains competitive inference speed. These results suggest that SVG offers a viable path toward efficient, interpretable NLP systems under compute constraints.
Disentangled Cross-Modal Representation Learning with Enhanced Mutual Supervision
Lu Gao · Wenlan Chen · Daoyuan Wang · Fei Guo · Cheng Liang
Cross-modal representation learning aims to extract semantically aligned representations from heterogeneous modalities such as images and text. Existing multimodal VAE-based models often suffer from limited capability to align heterogeneous modalities or lack sufficient structural constraints to clearly separate the modality-specific and shared factors. In this work, we propose a novel framework, termed Disentangled Cross-Modal Representation Learning with Enhanced Mutual Supervision (DCMEM). Specifically, our model disentangles the common and distinct information across modalities and regularizes the shared representation learned from each modality in a mutually supervised manner. Moreover, we incorporate the information bottleneck principle into our model to ensure that the shared and modality-specific factors encode exclusive yet complementary information. Notably, our model is designed to be trainable on both complete and partial multimodal datasets with a valid Evidence Lower Bound. Extensive experimental results demonstrate significant improvements of our model over existing methods on various tasks including cross-modal generation, clustering, and classification.
On the Edge of Memorization in Diffusion Models
Sam Buchanan · Druv Pai · Yi Ma · Valentin De Bortoli
When do diffusion models reproduce their training data, and when are they able to generate samples beyond it? A practically relevant theoretical understanding of this interplay between memorization and generalization may significantly impact real-world deployments of diffusion models with respect to issues such as copyright infringement and data privacy. In this paper, we propose a scientific and mathematical “laboratory” for investigating memorization and generalization in diffusion models trained on fully synthetic or natural image-like structured data. Within this setting, we theoretically characterize a crossover point wherein the weighted training loss of a fully generalizing model becomes greater than that of an underparameterized memorizing model at a critical value of model (under)parameterization. We then demonstrate via carefully-designed experiments that the location of this crossover predicts a phase transition in diffusion models trained via gradient descent, as our theory enables us to analytically predict the model size at which memorization becomes predominant. Our work provides an analytically tractable and practically meaningful setting for future theoretical and empirical investigations. Code for our experiments is available at https://github.com/DruvPai/diffusionmemgen.
Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation
Xinyu Yang · Yuwei An · Hongyi Liu · Tianqi Chen · Beidi Chen
Autoregressive Large Language Models (AR-LLMs) frequently exhibit implicit parallelism in sequential generation. Inspired by this, we introduce Multiverse, a new generative model enabling natively parallel generation. Multiverse internalizes a MapReduce paradigm, generating automatically through three stages: (i) a Map stage for adaptive task decomposition, (ii) a Process stage for parallel subtask execution, and (iii) a Reduce stage for lossless result synthesis. Next, we build a real-world Multiverse reasoning model with co-design of data, algorithm, and system, enabling rapid and seamless transfer from frontier AR-LLMs. For data creation, we develop Multiverse Curator, an automated LLM-assisted pipeline that transforms sequential reasoning chains into structured training data, avoiding costly human annotations. Algorithmically, we design Multiverse Attention to separate parallel reasoning steps while keeping compatibility with causal attention for efficient training. Systematically, we implement Multiverse Engine to support parallel inference. It features a dedicated interpreter that dynamically switches between sequential and parallel generation, triggered directly by the model. After a 3-hour fine-tuning with 1K examples, our Multiverse-32B stands as the only open-sourced non-AR model achieving performance on par with leading AR-LLMs of the same scale, evidenced by AIME24 & 25 scores of 54% and 46%, respectively. Moreover, our budget control experiments show that Multiverse-32B exhibits superior scaling, outperforming AR-LLMs by 1.87% on average using the same context length. Such scaling further leads to practical efficiency gain, achieving up to 2x speedup across varying batch sizes. We have open-sourced the entire Multiverse ecosystem, including data, model weights, serving system, supporting tools, as well as data curation prompts and detailed training and evaluation recipes.
InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion
Yuanyi Wang · Zhaoyi Yan · Yiming Zhang · Qi Zhou · Yanggan Gu · Fei Wu · Hongxia Yang
Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model's internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose \textbf{InfiGFusion}, the first structure-aware fusion framework with a novel \textit{Graph-on-Logits Distillation} (GLD) loss. Specifically, we retain the top-$k$ logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original $O(n^4)$ cost of Gromov-Wasserstein distance to $O(n \log n)$, with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.
LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities
Florian Sestak · Artur Toshev · Andreas Fürst · Günter Klambauer · Andreas Mayr · Johannes Brandstetter
Generative models are spearheading recent progress in deep learning, showcasing strong promise for trajectory sampling in dynamical systems as well. However, whereas latent space modeling paradigms have transformed image and video generation, similar approaches are more difficult for most dynamical systems. Such systems -- from chemical molecule structures to collective human behavior -- are described by interactions of entities, making them inherently linked to connectivity patterns, entity conservation, and the traceability of entities over time. Our approach, LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked Entities), bridges the gap between: (1) keeping the traceability of individual entities in a latent system representation, and (2) leveraging the efficiency and scalability of recent advances in image and video generation, where pre-trained encoder and decoder enable generative modeling directly in latent space. The core idea of LaM-SLidE is the introduction of identifier representations (IDs) that enable the retrieval of entity properties and entity composition from latent system representations, thus fostering traceability. Experimentally, across different domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy, and generalizability. Code is available at https://github.com/ml-jku/LaM-SLidE.
Asymmetric Dual Self-Distillation for 3D Self-Supervised Representation Learning
Remco Leijenaar · Hamidreza Kasaei
Learning semantically meaningful representations from unstructured 3D point clouds remains a central challenge in computer vision, especially in the absence of large-scale labeled datasets. While masked point modeling (MPM) is widely used in self-supervised 3D learning, its reconstruction-based objective can limit its ability to capture high-level semantics. We propose AsymDSD, an Asymmetric Dual Self-Distillation framework that unifies masked modeling and invariance learning through prediction in the latent space rather than the input space. AsymDSD builds on a joint embedding architecture and introduces several key design choices: an efficient asymmetric setup, disabling attention between masked queries to prevent shape leakage, multi-mask sampling, and a point cloud adaptation of multi-crop. AsymDSD achieves state-of-the-art results on ScanObjectNN (90.53\%) and further improves to 93.72\% when pretrained on 930k shapes, surpassing prior methods.
ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation
Lingfeng Wang · Hualing Lin · Senda Chen · Tao Wang · Changxu Cheng · Yangyang Zhong · Dong Zheng · Wuyue Zhao
While humans effortlessly draw visual objects and shapes by adaptively allocating attention based on their complexity, existing multimodal large language models (MLLMs) remain constrained by rigid token representations. Bridging this gap, we propose ALTo, an adaptive length tokenizer for autoregressive mask generation. To achieve this, a novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy. We further build ALToLLM that seamlessly integrates ALTo into MLLM. Preferences on the trade-offs between mask quality and efficiency is implemented by group relative policy optimization (GRPO). Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models will be released.
Gemstones: A Model Suite for Multi-Faceted Scaling Laws
Sean McLeish · John Kirchenbauer · David Miller · Siddharth Singh · Abhinav Bhatele · Micah Goldblum · Ashwinee Panda · Tom Goldstein
Scaling laws are typically fit using a family of models with a narrow range of frozen hyperparameter choices. In this work we study scaling laws using multiple architectural shapes and hyperparameter choices, highlighting their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: an open-source scaling law dataset, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters and diverse architectural shapes; including ablations over learning rate and cooldown. Our checkpoints enable more complex studies of scaling, such as analyzing the relationship between width and depth. By examining our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting.
Online Feedback Efficient Active Target Discovery in Partially Observable Environments
Anindya Sarkar · Binglin Ji · Yevgeniy Vorobeychik
In various scientific and engineering domains, where data acquisition is costly—such as in medical imaging, environmental monitoring, or remote sensing—strategic sampling from unobserved regions, guided by prior observations, is essential to maximize target discovery within a limited sampling budget. In this work, we introduce Diffusion-guided Active Target Discovery (DiffATD), a novel method that leverages diffusion dynamics for active target discovery. DiffATD maintains a belief distribution over each unobserved state in the environment, using this distribution to dynamically balance exploration-exploitation. Exploration reduces uncertainty by sampling regions with the highest expected entropy, while exploitation targets areas with the highest likelihood of discovering the target, indicated by the belief distribution and an incrementally trained reward model designed to learn the characteristics of the target. DiffATD enables efficient target discovery in a partially observable environment within a fixed sampling budget, all without relying on any prior supervised training. Furthermore, DiffATD offers interpretability, unlike existing black-box policies that require extensive supervised training. Through extensive experiments and ablation studies across diverse domains, including medical imaging, species discovery and remote sensing, we show that DiffATD performs significantly better than baselines and competitively with supervised methods that operate under full environmental observability.
Learnable Sampler Distillation for Discrete Diffusion Models
Feiyang Fu · Tongxian Guo · Zhaoqiang Liu
Discrete diffusion models (DDMs) have shown powerful generation ability for discrete data modalities like text and molecules. However, their practical application is hindered by inefficient sampling, requiring a large number of sampling steps. Accelerating DDMs by using larger step sizes typically introduces significant problems in generation quality, as it amplifies the impact of both the compounding decoding error due to factorized predictions and discretization error from numerical approximations, leading to a significant decrease in sampling quality. To address these challenges, we propose learnable sampler distillation (LSD), a novel approach to train fast and high-fidelity samplers for DDMs. LSD employs a distillation approach where a student sampler with a few steps learns to align its intermediate score trajectory with that of a high-quality teacher sampler with numerous steps. This alignment is achieved by optimizing learnable sampler coefficients that adaptively adjust sampling dynamics. Additionally, we further propose LSD+, which also learns time schedules that allocate steps non-uniformly. Experiments across text generation, image generation, and synthetic tasks demonstrate that our proposed approaches outperform existing samplers for DDMs, achieving substantially higher sampling quality with significantly fewer sampling steps.
Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models
Byeonghu Na · Mina Kang · Jiseok Kwak · Minsang Park · Jiwoo Shin · SeJoon Jun · Gayoung Lee · Jin-Hwa Kim · Il-chul Moon
Text-to-image models have recently made significant advances in generating realistic and semantically coherent images, driven by advanced diffusion models and large-scale web-crawled datasets. However, these datasets often contain inappropriate or biased content, raising concerns about the generation of harmful outputs when provided with malicious text prompts. We propose Safe Text embedding Guidance (STG), a training-free approach to improve the safety of diffusion models by guiding the text embeddings during sampling. STG adjusts the text embeddings based on a safety function evaluated on the expected final denoised image, allowing the model to generate safer outputs without additional training. Theoretically, we show that STG aligns the underlying model distribution with safety constraints, thereby achieving safer outputs while minimally affecting generation quality. Experiments on various safety scenarios, including nudity, violence, and artist-style removal, show that STG consistently outperforms both training-based and training-free baselines in removing unsafe content while preserving the core semantic intent of input prompts. Our code is available at https://github.com/aailab-kaist/STG.
We explore the connection between deep learning and information theory through the paradigm of diffusion models. A diffusion model converts noise into structured data by reinstating, imperfectly, information that is erased when data was diffused to noise. This information is stored in a neural network during training. We quantify this information by introducing a measure called \textit{neural entropy}, which is related to the total entropy produced by diffusion. Neural entropy is a function of not just the data distribution, but also the diffusive process itself. Measurements of neural entropy on a few simple image diffusion models reveal that they are extremely efficient at compressing large ensembles of structured data.
Weaver: Shrinking the Generation-Verification Gap by Scaling Compute for Verification
Jon Saad-Falcon · Estefany Kelly Buchanan · Mayee Chen · Tzu-Heng (Brian) Huang · Brendan McLaughlin · Tanvir Bhathal · Shang Zhu · Ben Athiwaratkun · Frederic Sala · Scott Linderman · Azalia Mirhoseini · Christopher Ré
Verifiers can improve language model (LM) capabilities by providing feedback or selecting the best response from a pool of generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean for formal proofs). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers. To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. First we find that weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in the verifiers. To reduce the dependency on labeled data, Weaver leverages weak supervision to estimate each verifier’s accuracy and combines their outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses several challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these challenges by using dataset statistics to normalize outputs and filter specific verifiers. We study the effectiveness of Weaver in repeated sampling settings, where a model generates multiple candidate responses at test time and a verifier is used to select the correct one. Our evaluations demonstrate that Weaver significantly improves the pass@1 performance across several reasoning and math tasks, achieving o3-mini level accuracy with Llama 3.3 70B Instruct (a much cheaper non-reasoning model) as the generator, and an ensemble of smaller judge and reward models as the verifiers (86.2% average). This gain mirrors the jump achieved between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training interventions. To make Weaver more efficient, we train a compact 400M cross-encoder using Weaver's combined output scores. This distilled model retains 98.7% of Weaver's full accuracy while reducing verification compute by up to 99.97%.
TreeGen: A Bayesian Generative Model for Hierarchies
Marcel Kollovieh · Nils Fleischmann · Filippo Guerranti · Bertrand Charpentier · Stephan Günnemann
In this work, we introduce TreeGen, a novel generative framework modeling distributions over hierarchies. We extend Bayesian Flow Networks (BFNs) to enable transitions between probabilistic and discrete hierarchies parametrized via categorical distributions. Our proposed scheduler provides smooth and consistent entropy decay across varying numbers of categories. We empirically evaluate TreeGen on the jet-clustering task in high-energy physics, demonstrating that it consistently generates valid trees that adhere to physical constraints and closely align with ground-truth log-likelihoods. Finally, by comparing TreeGen’s samples to the exact posterior distribution and performing likelihood maximization via rejection sampling, we demonstrate that TreeGen outperforms various baselines.
System-Embedded Diffusion Bridge Models
Bartlomiej Sobieski · Matthew Tivnan · Yuang Wang · Siyeop yoon · Pengfei Jin · Dufan Wu · Quanzheng Li · Przemyslaw Biecek
Solving inverse problems—recovering signals from incomplete or noisy measurements—is fundamental in science and engineering. Score-based generative models (SGMs) have recently emerged as a powerful framework for this task. Two main paradigms have formed: unsupervised approaches that adapt pretrained generative models to inverse problems, and supervised bridge methods that train stochastic processes conditioned on paired clean and corrupted data. While the former typically assume knowledge of the measurement model, the latter have largely overlooked this structural information. We introduce System-embedded Diffusion Bridge Models (SDBs), a new class of supervised bridge methods that explicitly embed the known linear measurement system into the coefficients of a matrix-valued SDE. This principled integration yields consistent improvements across diverse linear inverse problems and demonstrates robust generalization under system misspecification between training and deployment, offering a promising solution to real-world applications.
GPO: Learning from Critical Steps to Improve LLM Reasoning
Jiahao Yu · Zelei Cheng · Xian Wu · Xinyu Xing
Large language models (LLMs) are increasingly used in various domains, showing impressive potential on various tasks. Recently, reasoning LLMs have been proposed to improve the \textit{reasoning} or \textit{thinking} capabilities of LLMs to solve complex problems. Despite the promising results of reasoning LLMs, enhancing the multi-step reasoning capabilities of LLMs still remains a significant challenge. While existing optimization methods have advanced the LLM reasoning capabilities, they often treat reasoning trajectories as a whole, without considering the underlying critical steps within the trajectory. In this paper, we introduce \textbf{G}uided \textbf{P}ivotal \textbf{O}ptimization (GPO), a novel fine-tuning strategy that dives into the reasoning process to enable more effective improvements. GPO first identifies the `critical step' within a reasoning trajectory - a point that the model must carefully proceed so as to succeed at the problem. We locate the critical step by estimating the advantage function. GPO then resets the policy to the critical step and samples the new rollout and prioritizes learning process on those rollouts. This focus allows the model to learn more effectively from pivotal moments within the reasoning process to improve the reasoning performance. We demonstrate that GPO is not a standalone method, but rather a general strategy that can be integrated with various optimization methods to improve reasoning performance. Besides theoretical analysis, our experiments across challenging reasoning benchmarks show that GPO can consistently and significantly enhances the performance of existing optimization methods, showcasing its effectiveness and generalizability in improving LLM reasoning by concentrating on pivotal moments within the generation process.
Optimal Neural Compressors for the Rate-Distortion-Perception Tradeoff
Eric Lei · Hamed Hassani · Shirin Saeedi Bidokhti
Recent efforts in neural compression have focused on the rate-distortion-perception (RDP) tradeoff, where the perception constraint ensures the source and reconstruction distributions are close in terms of a statistical divergence. Theoretical work on RDP describes properties of RDP-optimal compressors without providing constructive and low complexity solutions. While classical rate-distortion theory shows that optimal compressors should efficiently pack space, RDP theory additionally shows that infinite randomness shared between the encoder and decoder may be necessary for RDP optimality. In this paper, we propose neural compressors that are low complexity and benefit from high packing efficiency through lattice coding and shared randomness through shared dithering over the lattice cells. For two important settings, namely infinite shared and zero shared randomness, we analyze the RDP tradeoff achieved by our proposed neural compressors and show optimality in both cases. Experimentally, we investigate the roles that these two components of our design, lattice coding and randomness, play in the performance of neural compressors on synthetic and real-world data. We observe that performance improves with more shared randomness and better lattice packing.
Pairwise Optimal Transports for Training All-to-All Flow-Based Condition Transfer Model
Kotaro Ikeda · Masanori Koyama · Jinzhe Zhang · Kohei Hayashi · Kenji Fukumizu
In this paper, we propose a flow-based method for learning all-to-all transfer maps among conditional distributions that approximates pairwise optimal transport. The proposed method addresses the challenge of handling the case of continuous conditions, which often involve a large set of conditions with sparse empirical observations per condition. We introduce a novel cost function that enables simultaneous learning of optimal transports for all pairs of conditional distributions. Our method is supported by a theoretical guarantee that, in the limit, it converges to the pairwise optimal transports among infinite pairs of conditional distributions. The learned transport maps are subsequently used to couple data points in conditional flow matching. We demonstrate the effectiveness of this method on synthetic and benchmark datasets, as well as on chemical datasets in which continuous physical properties are defined as conditions.
Diffusion on Demand: Selective Caching and Modulation for Efficient Generation
Hee Min Choi · Hyoa Kang · Dokwan Oh · Nam Ik Cho
Diffusion transformers demonstrate significant potential for various generation tasks but are challenged by high computational cost. Recently, feature caching methods have been introduced to improve inference efficiency by storing features at certain timesteps and reusing them at subsequent timesteps. However, their effectiveness is limited as they rely only on choosing between cached features and performing model inference. Motivated by high cosine similarity between features across consecutive timesteps, we propose a cache-based framework that reuses features and selectively adapts them through linear modulation. In our framework, the selection is performed via a modulation gate, and both the gate and modulation parameters are learned. Extensive experiments show that our method achieves similar generation performance to the original sampler while requiring significantly less computation. For example, FLOPs and inference latency are reduced by $2.93\times$ and $2.15\times$ for DiT-XL/2 and by $2.83\times$ and $1.50\times$ for PixArt-$\alpha$, respectively. We find that modulation is effective when applied to as little as 2\% of layers, resulting in negligible computation overhead.
PID-controlled Langevin Dynamics for Faster Sampling on Generative Models
Hongyi Chen · Jianhai Shu · Jingtao Ding · Yong Li · Xiao-Ping (Steven) Zhang
Langevin dynamics sampling suffers from extremely low generation speed, fundamentally limited by numerous fine-grained iterations to converge to the target distribution. We introduce PID-controlled Langevin Dynamics (PIDLD), a novel sampling acceleration algorithm that reinterprets the sampling process using control-theoretic principles. By treating energy gradients as feedback signals, PIDLD combines historical gradients (the integral term) and gradient trends (the derivative term) to efficiently traverse energy landscapes and adaptively stabilize, thereby significantly reducing the number of iterations required to produce high-quality samples. Our approach requires no additional training, datasets, or prior information, making it immediately integrable with any Langevin-based method. Extensive experiments across image generation and reasoning tasks demonstrate that PIDLD achieves higher quality with fewer steps, making Langevin-based generative models more practical for efficiency-critical applications. The implementation can be found at \href{https://github.com/tsinghua-fib-lab/PIDLD}{https://github.com/tsinghua-fib-lab/PIDLD}.
Spectral Analysis of Diffusion Models with Application to Schedule Design
Roi Benita · Miki Elad · Joseph Keshet
Diffusion models (DMs) have emerged as powerful tools for modeling complex data distributions and generating realistic new samples. Over the years, advanced architectures and sampling methods have been developed to make these models practically usable. However, certain synthesis process decisions still rely on heuristics without a solid theoretical foundation. In our work, we offer a novel analysis of the DM's inference process, introducing a comprehensive frequency response perspective. Specifically, by relying on Gaussianity assumption, we present the inference process as a closed-form spectral transfer function, capturing how the generated signal evolves in response to the initial noise. We demonstrate how the proposed analysis can be leveraged to design a noise schedule that aligns effectively with the characteristics of the data. The spectral perspective also provides insights into the underlying dynamics and sheds light on the relationship between spectral properties and noise schedule structure. Our results lead to scheduling curves that are dependent on the spectral content of the data, offering a theoretical justification for some of the heuristics taken by practitioners.
NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables
Lanrui Wang · Mingyu Zheng · Hongyin Tang · Zheng Lin · Yanan Cao · Jingang Wang · Xunliang Cai · Weiping Wang
Processing structured tabular data, particularly large and lengthy tables, constitutes a fundamental yet challenging task for large language models (LLMs). However, existing long-context benchmarks like Needle-in-a-Haystack primarily focus on unstructured text, neglecting the challenge of diverse structured tables. Meanwhile, previous tabular benchmarks mainly consider downstream tasks that require high-level reasoning abilities, and overlook models' underlying fine-grained perception of individual table cells, which is crucial for practical and robust LLM-based table applications. To address this gap, we introduce \textsc{NeedleInATable} (NIAT), a new long-context tabular benchmark that treats each table cell as a ``needle'' and requires models to extract the target cell based on cell locations or lookup questions. Our comprehensive evaluation of various LLMs and multimodal LLMs reveals a substantial performance gap between popular downstream tabular tasks and the simpler NIAT task, suggesting that they may rely on dataset-specific correlations or shortcuts to obtain better benchmark results but lack truly robust long-context understanding towards structured tables. Furthermore, we demonstrate that using synthesized NIAT training data can effectively improve performance on both NIAT task and downstream tabular tasks, which validates the importance of NIAT capability for LLMs' genuine table understanding ability. Our data, code and models will be released to facilitate future research.
RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases
Dongwon Choi · Sunwoo Kim · Juyeon Kim · Kyungho Kim · Geon Lee · Shinhwan Kang · Myunghwan Kim · Kijung Shin
Recent advances have demonstrated the effectiveness of graph-based machine learning on relational databases (RDBs) for predictive tasks. Such approaches require transforming RDBs into graphs, a process we refer to as RDB-to-graph modeling, where rows of tables are represented as nodes and foreign-key relationships as edges.Yet, effective modeling of RDBs into graphs remains challenging.Specifically, there exist numerous ways to model RDBs into graphs, and performance on predictive tasks varies significantly depending on the chosen graph model of RDBs.In our analysis, we find that the best-performing graph model can yield up to a 10\% higher performance compared to the common heuristic rule for graph modeling, which remains non-trivial to identify.To foster research on intelligent RDB-to-graph modeling, we introduce RDB2G-Bench, the first benchmark framework for evaluating such methods.We construct extensive datasets covering 5 real-world RDBs and 12 predictive tasks, resulting in around 50k graph model–performance pairs for efficient and reproducible evaluations.Thanks to our precomputed datasets, we were able to benchmark 10 automatic RDB-to-graph modeling methods on the $12$ tasks about 380$\times$ faster than on-the-fly evaluation, which requires repeated GNN training.Our analysis of the datasets and benchmark results reveals key structural patterns affecting graph model effectiveness, along with practical implications for effective graph modeling.Our datasets and code are available at https://github.com/chlehdwon/RDB2G-Bench.
GC4NC: A Benchmark Framework for Graph Condensation on Node Classification with New Insights
Shengbo Gong · Juntong Ni · Noveen Sachdeva · Carl Yang · Wei Jin
Graph condensation (GC) is an emerging technique designed to learn a significantly smaller graph that retains the essential information of the original graph. This condensed graph has shown promise in accelerating graph neural networks while preserving performance comparable to those achieved with the original, larger graphs. Additionally, this technique facilitates downstream applications like neural architecture search and deepens our understanding of redundancies in large graphs. Despite the rapid development of GC methods, particularly for node classification, a unified evaluation framework is still lacking to systematically compare different GC methods or clarify key design choices for improving their effectiveness. To bridge these gaps, we introduce GC4NC, a comprehensive framework for evaluating diverse GC methods on node classification across multiple dimensions including performance, efficiency, privacy preservation, denoising ability, NAS effectiveness, and transferability. Our systematic evaluation offers novel insights into how condensed graphs behave and the critical design choices that drive their success. These findings pave the way for future advancements in GC methods, enhancing both performance and expanding their real-world applications. The code is available at https://github.com/Emory-Melody/GraphSlim/tree/main/benchmark.
Random Search Neural Networks for Efficient and Expressive Graph Learning
Michael Ito · Danai Koutra · Jenna Wiens
Random walk neural networks (RWNNs) have emerged as a promising approach for graph representation learning, leveraging recent advances in sequence models to process random walks. However, under realistic sampling constraints, RWNNs often fail to capture global structure even in small graphs due to incomplete node and edge coverage, limiting their expressivity. To address this, we propose \textit{random search neural networks} (RSNNs), which operate on random searches, each of which guarantees full node coverage. Theoretically, we demonstrate that in sparse graphs, only $O(\log |V|)$ searches are needed to achieve full edge coverage, substantially reducing sampling complexity compared to the $O(|V|)$ walks required by RWNNs (assuming walk lengths scale with graph size). Furthermore, when paired with universal sequence models, RSNNs are universal approximators. We lastly show RSNNs are probabilistically invariant to graph isomorphisms, ensuring their expectation is an isomorphism-invariant graph function. Empirically, RSNNs consistently outperform RWNNs on molecular and protein benchmarks, achieving comparable or superior performance with up to 16$\times$ fewer sampled sequences. Our work bridges theoretical and practical advances in random walk based approaches, offering an efficient and expressive framework for learning on sparse graphs.
Effects of Dropout on Performance in Long-range Graph Learning Tasks
Jasraj Singh · Keyue Jiang · Brooks Paige · Laura Toni
Message Passing Neural Networks (MPNNs) are a class of Graph Neural Networks (GNNs) that propagate information across the graph via local neighborhoods. The scheme gives rise to two key challenges: over-smoothing and over-squashing. While several Dropout-style algorithms, such as DropEdge and DropMessage, have successfully addressed over-smoothing, their impact on over-squashing remains largely unexplored. This represents a critical gap in the literature, as failure to mitigate over-squashing would make these methods unsuitable for long-range tasks – the intended use case of deep MPNNs. In this work, we study the aforementioned algorithms, and closely related edge-dropping algorithms – DropNode, DropAgg and DropGNN – in the context of over-squashing. We present theoretical results showing that DropEdge-variants reduce sensitivity between distant nodes, limiting their suitability for long-range tasks. To address this, we introduce DropSens, a sensitivity-aware variant of DropEdge, which is developed following the message-passing scheme of GCN. DropSens explicitly controls the proportion of information lost due to edge-dropping, thereby increasing sensitivity to distant nodes despite dropping the same number of edges. Our experiments on long-range synthetic and real-world datasets confirm the predicted limitations of existing edge-dropping and feature-dropping methods. Moreover, DropSens with GCN consistently outperforms graph rewiring techniques designed to mitigate over-squashing, suggesting that simple, targeted modifications can substantially improve a model’s ability to capture long-range interactions. Our conclusions highlight the need to re-evaluate and re-design existing methods for training deep GNNs, with a renewed focus on modelling long-range interactions. The code for reproducing the results in our this work is available at https://github.com/ignasa007/Dropout-Effects-GNNs.
Conditional Diffusion Anomaly Modeling on Graphs
Chunyu Wei · Haozhe Lin · Yueguo Chen · Yunhai Wang
Graph anomaly detection (GAD) has become a critical research area, with successful applications in financial fraud and telecommunications. Traditional Graph Neural Networks (GNNs) face significant challenges: at the topology level, they suffer from over-smoothing that averages out anomalous signals; at the feature level, discriminative models struggle when fraudulent nodes obfuscate their features to evade detection. In this paper, we propose a Conditional Graph Anomaly Diffusion Model (CGADM) that addresses these issues through the iterative refinement and denoising reconstruction properties of diffusion models. Our approach incorporates a prior-guided diffusion process that injects a pre-trained conditional anomaly estimator into both forward and reverse diffusion chains, enabling more accurate anomaly detection. For computational efficiency on large-scale graphs, we introduce a prior confidence-aware mechanism that adaptively determines the number of reverse denoising steps based on prior confidence. Experimental results on benchmark datasets demonstrate that CGADM achieves state-of-the-art performance while maintaining significant computational advantages for large-scale graph applications.
Preference-driven Knowledge Distillation for Few-shot Node Classification
Xing Wei · Chunchun Chen · Rui Fan · Xiaofeng Cao · Sourav Medya · Wei Ye
Graph neural networks (GNNs) can efficiently process text-attributed graphs (TAGs) due to their message-passing mechanisms, but their training heavily relies on the human-annotated labels. Moreover, the complex and diverse local topologies of nodes of real-world TAGs make it challenging for a single mechanism to handle. Large language models (LLMs) perform well in zero-/few-shot learning on TAGs but suffer from a scalability challenge. Therefore, we propose a preference-driven knowledge distillation (PKD) framework to synergize the complementary strengths of LLMs and various GNNs for few-shot node classification. Specifically, we develop a GNN-preference-driven node selector that effectively promotes prediction distillation from LLMs to teacher GNNs. To further tackle nodes' intricate local topologies, we develop a node-preference-driven GNN selector that identifies the most suitable teacher GNN for each node, thereby facilitating tailored knowledge distillation from teacher GNNs to the student GNN. Extensive experiments validate the efficacy of our proposed framework in few-shot node classification on real-world TAGs. Our code can be available at .
AMBER: Adaptive Mesh Generation by Iterative Mesh Resolution Prediction
Niklas Freymuth · Tobias Würth · Nicolas Schreiber · Balázs Gyenes · Andreas Boltres · Johannes Mitsch · Aleksandar Taranovic · Tai Hoang · Philipp Dahlinger · Philipp Becker · Luise Kärger · Gerhard Neumann
The cost and accuracy of simulating complex physical systems using the Finite Element Method (FEM) scales with the resolution of the underlying mesh. Adaptive meshes improve computational efficiency by refining resolution in critical regions, but typically require task-specific heuristics or cumbersome manual design by a human expert. We propose Adaptive Meshing By Expert Reconstruction (AMBER), a supervised learning approach to mesh adaptation. Starting from a coarse mesh, AMBER iteratively predicts the sizing field, i.e., a function mapping from the geometry to the local element size of the target mesh, and uses this prediction to produce a new intermediate mesh using an out-of-the-box mesh generator. This process is enabled through a hierarchical graph neural network, and relies on data augmentation by automatically projecting expert labels onto AMBER-generated data during training. We evaluate AMBER on 2D and 3D datasets, including classical physics problems, mechanical components, and real-world industrial designs with human expert meshes. AMBER generalizes to unseen geometries and consistently outperforms multiple recent baselines, including ones using Graph and Convolutional Neural Networks, and Reinforcement Learning-based approaches.
GraphTOP: Graph Topology-Oriented Prompting for Graph Neural Networks
Xingbo Fu · Zhenyu Lei · Zihan Chen · Binchi Zhang · Chuxu Zhang · Jundong Li
Graph Neural Networks (GNNs) have revolutionized the field of graph learning by learning expressive graph representations from massive graph data. As a common pattern to train powerful GNNs, the "pre-training, adaptation" scheme first pre-trains GNNs over unlabeled graph data and subsequently adapts them to specific downstream tasks. In the adaptation phase, graph prompting is an effective strategy that modifies input graph data with learnable prompts while keeping pre-trained GNN models frozen. Typically, existing graph prompting studies mainly focus on feature-oriented methods that apply graph prompts to node features or hidden representations. However, these studies often achieve suboptimal performance, as they consistently overlook the potential of topology-oriented prompting, which adapts pre-trained GNNs by modifying the graph topology. In this study, we conduct a pioneering investigation of graph prompting in terms of graph topology. We propose the first Graph Topology-Oriented Prompting (GraphTOP) framework to effectively adapt pre-trained GNN models for downstream tasks. More specifically, we reformulate topology-oriented prompting as an edge rewiring problem within multi-hop local subgraphs and relax it into the continuous probability space through reparameterization while ensuring tight relaxation and preserving graph sparsity. Extensive experiments on five graph datasets under four pre-training strategies demonstrate that our proposed GraphTOP outshines six baselines on multiple node classification datasets. Our code is available at https://github.com/xbfu/GraphTOP.
HubGT: Fast Graph Transformer with Decoupled Hierarchy Labeling
Ningyi Liao · Zihao Yu · Siqiang Luo · Gao Cong
Graph Transformer (GT) has recently emerged as a promising neural network architecture for learning graph-structured data. However, its global attention mechanism with quadratic complexity concerning the graph scale prevents wider application to large graphs. Effectively representing graph information while ensuring learning efficiency remains challenging, as our analysis reveals that current GT designs targeting scalability still suffer from the computational bottleneck related to graph-scale operations. In this work, we tackle the GT scalability issue by proposing HubGT, a scalable Graph Transformer boosted by fully decoupled graph processing and simplified learning. HubGT represents the graph by a novel hierarchical scheme exploiting hub labels, which is shown to be more informative than plain adjacency by offering global connections while promoting locality, and is particularly suitable for handling complex graph patterns such as heterophily. We also design algorithms for efficiently constructing and querying the hub label hierarchy tailored for the GT attention training in scalable deployments. Notably, the precomputation and training processes of HubGT achieve complexities linear to the number of graph edges and nodes, respectively, while the training stage completely removes graph-related computations, leading to favorable mini-batch capability and GPU utilization. Extensive experiments demonstrate that HubGT is efficient in terms of computational enhancement and mini-batch capability over existing GT designs on large-scale benchmarks, while achieving top-tier effectiveness on both homophilous and heterophilous graphs.
GAMMA: Gated Multi-hop Message Passing for Homophily-Agnostic Node Representation in GNNs
Amir Ghazizadeh · Rickard Ewetz · Hao Zheng
The success of Graph Neural Networks (GNNs) leverages the homophily principle, where connected nodes share similar features and labels. However, this assumption breaks down in heterophilic graphs, where same-class nodes are often distributed across distant neighborhoods rather than immediate connections. Recent attempts expand the receptive field through multi-hop aggregation schemes that explicitly preserve intermediate representations from each hop distance. While effective at capturing heterophilic patterns, these methods require separate weight matrices per hop and feature concatenation, causing parameters to scale linearly with hop count. This leads to high computational complexity and GPU memory consumption. We propose Gated Multi-hop Message Passing (GAMMA), where nodes assess how relevant the aggregated information is from their k-hop neighbors. This assessment occurs through multiple refinement steps where the node compares each hop's embedding with its current representation, allowing it to focus on the most informative hops. During the forward pass, GAMMA finds the optimal mix of multi-hop information local to each node using a single feature vector without needing separate representations for each hop, thereby maintaining dimensionality comparable to single hop GNNs. In addition, we propose a weight sharing scheme that leverages a unified transformation for aggregated features from multiple hops so the global heterophilic patterns specific to each hop are learned during training. As such, GAMMA captures both global (per-hop) and local (per-node) heterophily patterns without high computation and memory overhead. Experiments show GAMMA matches or exceeds state-of-the-art heterophilic GNN accuracy, achieving up to $\approx20\times$ faster inference. Our code is publicly available at \url{https://github.com/amir-ghz/GAMMA}.
Disentangling Hyperedges through the Lens of Category Theory
Yoonho Lee · Junseok Lee · Sangwoo Seo · Sungwon Kim · Yeongmin Kim · Chanyoung Park
Despite the promising results of disentangled representation learning in discovering latent patterns in graph-structured data, few studies have explored disentanglement for hypergraph-structured data. Integrating hyperedge disentanglement into hypergraph neural networks enables models to leverage hidden hyperedge semantics, such as unannotated relations between nodes, that are associated with labels. This paper presents an analysis of hyperedge disentanglement from a category-theoretical perspective and proposes a novel criterion for disentanglement derived from the naturality condition. Our proof-of-concept model experimentally showed the potential of the proposed criterion by successfully capturing functional relations of genes (nodes) in genetic pathways (hyperedges).
Many real-world graphs exhibit diverse and complex topological structures that are not well captured by geometric manifolds with uniform global curvature, such as hyperbolic or spherical spaces. Recently, there has been growing interest in embedding graphs into pseudo-Riemannian manifolds, which generalize both hyperbolic and spherical geometries. However, existing approaches face three significant limitations, including the ineffective pseudo-Riemannain framework, the shallow architectures, and the absence of clear guideline for selecting suitable pseudo-Riemannian manifolds. To address these issues, we introduce a novel diffeomorphic framework for graph embedding that aligns with the nature of pseudo-Riemannian manifolds. Subsequently, we propose the pseudo-Riemannian Graph Transformer for learning representations of complex graph structures. Our diffeomorphic framework in pseudo-Riemannian geometry enables the principled definitions of core transformer components, including linear attention, residual connection, and layer normalization. Finally, we develop a lightweight space searching algorithm to automatically identify the most suitable pseudo-Riemannian manifold for an input graph. Extensive experiments on diverse real-world graphs demonstrate that our model consistently outperforms other baselines in both node classification and link prediction tasks.
Where Graph Meets Heterogeneity: Multi-View Collaborative Graph Experts
Zhihao Wu · Jinyu Cai · Yunhe Zhang · Jielong Lu · Zhaoliang Chen · Shuman Zhuang · Haishuai Wang
The convergence of graph learning and multi-view learning has propelled the emergence of multi-view graph neural networks (MGNNs), offering unprecedented capabilities to address complex real-world data characterized by heterogeneous yet interconnected information. While existing MGNNs exploit the potential of multi-view graphs, they often fail to harmonize the dual inductive biases critical to multi-view learning: consistency (inherent inter-view agreement) and complementarity (view-specific distinctiveness). To bridge this gap, we propose Multi-view Collaborative Graph Experts (MvCGE), a novel framework grounded in the Mixture-of-Experts (MoE) paradigm. MvCGE establishes architectural consistency through shared parameters while preserving complementarity via layer-wise collaborative graph experts, which are dynamically activated by a graph-aware routing mechanism that adapts to the structural nuances of each view. This dual-level design is further reinforced by two novel components: a load equilibrium loss to prevent expert collapse and ensure balanced specialization, and a graph discrepancy loss based on distributional divergence to enhance inter-view complementarity. Extensive experiments on diverse datasets demonstrate MvCGE’s superiority.
Memorization in Graph Neural Networks
Adarsh Jamadandi · Jing Xu · Adam Dziedzic · Franziska Boenisch
Deep neural networks (DNNs) have been shown to memorize their training data, but similar analyses for graph neural networks (GNNs) remain under-explored. We introduce NCMemo (Node Classification Memorization), the first framework to quantify label memorization in semi-supervised node classification. We establish an inverse relationship between memorization and graph homophily, i.e the tendency of connected nodes to share labels or features. Lower homophily significantly increases memorization, indicating that GNNs rely on label memorization when learning less homophilic graphs. We then analyze GNN training dynamics and find that increased memorization in low-homophily graphs is tightly coupled to GNNs' implicit bias toward using graph structure. When structure is less informative, models instead memorize node labels to minimize training loss. Finally, we show that nodes with higher label inconsistency in their feature-space neighborhood are more prone to memorization. Based on these insights, we investigate graph rewiring as a mitigation strategy. Our results show that rewiring reduces memorization without harming model performance, while also lowering the privacy risk for previously memorized data points. Thus, our work advances understanding of GNN learning and supports more privacy-preserving GNN deployment.
Learning Sparse Approximate Inverse Preconditioners for Conjugate Gradient Solvers on GPUs
Zhehao Li · Zhehao Li · Kangbo Lyu · Yixuan Li · Tao Du · Ligang Liu
The conjugate gradient solver (CG) is a prevalent method for solving symmetric and positive definite linear systems $\mathbf{Ax} = \mathbf{b}$, where effective preconditioners are crucial for fast convergence. Traditional preconditioners rely on prescribed algorithms to offer rigorous theoretical guarantees, while limiting their ability to exploit optimization from data. Existing learning-based methods often utilize Graph Neural Networks (GNNs) to improve the performance and speed up the construction. However, their reliance on incomplete factorization leads to significant challenges: the associated triangular solve hinders GPU parallelization in practice, and introduces long-range dependencies which are difficult for GNNs to model. To address these issues, we propose a learning-based method to generate GPU-friendly preconditioners, particularly using GNNs to construct Sparse Approximate Inverse (SPAI) preconditioners, which avoids triangular solves and requires only two matrix-vector products at each CG step. The locality of matrix-vector product is compatible with the local propagation mechanism of GNNs. The flexibility of GNNs also allows our approach to be applied in a wide range of scenarios. Furthermore, we introduce a statistics-based scale-invariant loss function. Its design matches CG's property that the convergence rate depends on the condition number, rather than the absolute scale of $\mathbf{A}$, leading to improved performance of the learned preconditioner. Evaluations on three PDE-derived datasets and one synthetic dataset demonstrate that our method outperforms standard preconditioners (Diagonal, IC, and traditional SPAI) and previous learning-based preconditioners on GPUs. We reduce solution time on GPUs by 40%-53% (68%-113% faster), along with better condition numbers and superior generalization performance.
Enhancing Graph Classification Robustness with Singular Pooling
Sofiane Ennadir · Oleg Smirnov · Yassine ABBAHADDOU · Lele Cao · Johannes Lutzeyer
Graph Neural Networks (GNNs) have achieved strong performance across a range of graph representation learning tasks, yet their adversarial robustness in graph classification remains underexplored compared to node classification. While most existing defenses focus on the message-passing component, this work investigates the overlooked role of pooling operations in shaping robustness. We present a theoretical analysis of standard flat pooling methods (sum, average and max), deriving upper bounds on their adversarial risk and identifying their vulnerabilities under different attack scenarios and graph structures. Motivated by these insights, we propose Robust Singular Pooling (RS-Pool), a novel pooling strategy that leverages the dominant singular vector of the node embedding matrix to construct a robust graph-level representation. We theoretically investigate the robustness of RS-Pool and interpret the resulting bound leading to improved understanding of our proposed pooling operator. While our analysis centers on Graph Convolutional Networks (GCNs), RS-Pool is model-agnostic and can be implemented efficiently via power iteration. Empirical results on real-world benchmarks show that RS-Pool provides better robustness than the considered pooling methods when subject to state-of-the-art adversarial attacks while maintaining competitive clean accuracy.
Characterizing the Expressivity of Fixed-Precision Transformer Language Models
Jiaoda Li · Ryan Cotterell
Transformer-based language models (LMs) have achieved widespread empirical success, but their theoretical expressive power remains only partially understood. In this work, we analyze a restricted idealization of fixed-precision transformers with strict future masking, soft attention, and no positional encodings. We establish that this class of models is exactly as expressive as a specific fragment of linear temporal logic that contains only a single temporal operator: the $\texttt{past}$ operator. We further connect this fragment to established classes in formal language theory, automata theory, and algebra, yielding a unified framework for understanding transformer expressivity under this idealization. Finally, we present empirical results that align closely with our theory: transformers trained on languages within their characterized expressive capacity generalize reliably across sequence lengths, while they consistently fail to generalize on languages beyond it.
Interpretable Global Minima of Deep ReLU Neural Networks on Sequentially Separable Data
Thomas Chen · Patricia Muñoz Ewald
We explicitly construct zero loss neural network classifiers. We write the weight matrices and bias vectors in terms of cumulative parameters, which determine truncation maps acting recursively on input space. The configurations for the training data considered are $(i)$ sufficiently small, well separated clusters corresponding to each class, and $(ii)$ equivalence classes which are sequentially linearly separable. In the best case, for $Q$ classes of data in $\mathbb{R}^{M}$, global minimizers can be described with $Q(M+2)$ parameters.
RepGuard: Adaptive Feature Decoupling for Robust Backdoor Defense in Large Language Models
Chenxu Niu · Jie Zhang · Yanbing Liu · Yunpeng Li · Jinta Weng · Yue Hu
Backdoor attacks pose a significant threat to large language models (LLMs) by embedding malicious triggers that manipulate model behavior. However, existing defenses primarily rely on prior knowledge of backdoor triggers or targets and offer only superficial mitigation strategies, thus struggling to fundamentally address the inherent reliance on unreliable features. To address these limitations, we propose a novel defense strategy, \textit{RepGuard}, that strengthens LLM resilience by adaptively separating abnormal features from useful semantic representations, rendering the defense agnostic to specific trigger patterns. Specifically, we first introduce a dual-perspective feature localization strategy that integrates local consistency and sample-wise deviation metrics to identify suspicious backdoor patterns. Based on this identification, an adaptive mask generation mechanism is applied to isolate backdoor-targeted shortcut features by decomposing hidden representations into independent spaces, while preserving task-relevant semantics. With a multi-objective optimization framework, our method can inherently mitigates backdoor attacks. Across \textit{Target Refusal} and \textit{Jailbreak} tasks under four types of attacks, RepGuard consistently reduced the attack success rate on poisoned data by nearly 80\% on average, while maintaining near-original task performance on clean data. Extensive experiments demonstrate that RepGuard provides a scalable and interpretable solution for safeguarding LLMs against sophisticated backdoor threats.
Uncertainty Quantification for Deep Regression using Contextualised Normalizing Flows
Adriel Sosa Marco · John D. Kirwan · Alexia Toumpa · Simos Gerasimou
Quantifying uncertainty in deep regression models is important both for understanding the confidence of the model and for safe decision-making in high-risk domains. Existing approaches that yield prediction intervals overlook distributional information, neglecting the effect of multimodal or asymmetric distributions on decision-making. Similarly, full or approximated Bayesian methods, while yielding the predictive posterior density, demand major modifications to the model architecture and retraining. We introduce MCNF, a novel post hoc uncertainty quantification method that produces both prediction intervals and the full conditioned predictive distribution. MCNF operates on top of the underlying trained predictive model; thus, no predictive model retraining is needed. We provide experimental evidence that the MCNF-based uncertainty estimate is well calibrated, is competitive with state-of-the-art uncertainty quantification methods, and provides richer information for downstream decision-making tasks
Monitoring Risks in Test-Time Adaptation
Mona Schirmer · Metod Jazbec · Christian Andersson Naesseth · Eric Nalisnick
Encountering shifted data at test time is a ubiquitous challenge when deploying predictive machine learning models. Test-time adaptation (TTA) methods aim to address this issue by continuously adapting a deployed model using only unlabeled test data. While TTA can help extend the model's deployment lifespan, there are scenarios where, despite adaptation, the drop in the model's performance remains significant enough to warrant taking the model offline and retraining. To detect such failure cases, we propose pairing TTA with risk monitoring frameworks that track predictive performance and raise alerts when predefined performance criteria are violated. Specifically, we extend existing monitoring tools based on sequential testing with confidence sequences to accommodate scenarios where the model is updated at test time and no test labels are available to estimate the performance metrics of interest. Our extensions unlock the application of rigorous statistical risk monitoring in TTA and we demonstrate applicability of our proposed TTA monitoring framework across a representative set of TTA methods, datasets and distribution shift types.
Fuz-RL: A Fuzzy-Guided Robust Framework for Safe Reinforcement Learning under Uncertainty
Xu Wan · Chao Yang · Cheng Yang · Jie Song · Mingyang Sun
Safe Reinforcement Learning (RL) is crucial for achieving high performance while ensuring safety in real-world applications. However, the complex interplay of multiple uncertainty sources in real environments poses significant challenges for interpretable risk assessment and robust decision-making. To address these challenges, we propose \textbf{Fuz-RL}, a fuzzy measure-guided robust framework for safe RL. Specifically, our framework develops a novel fuzzy Bellman operator for estimating robust value functions using Choquet integrals. Theoretically, we prove that solving the Fuz-RL problem (in Constrained Markov Decision Process (CMDP) form) is equivalent to solving distributionally robust safe RL problems (in robust CMDP form), effectively avoiding min-max optimization. Empirical analyses on safe-control-gym and safety-gymnasium scenarios demonstrate that Fuz-RL effectively integrates with existing safe RL baselines in a model-free manner, significantly improving both safety and control performance under various types of uncertainties in observation, action, and dynamics.
Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum
Wenquan Lu · Jiaqi Zhang · Hugues Van Assel · Randall Balestriero
Self-Supervised Learning (SSL) has become a powerful solution to extract rich representations from unlabeled data. Yet, SSL research is mostly focused on clean, curated and high-quality datasets. As a result, applying SSL on noisy data remains a challenge, despite being crucial to applications such as astrophysics, medical imaging, geophysics or finance. In this work, we present a fully self-supervised framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning. Our method first trains an SSL denoiser on noisy data, then uses it to construct a denoised-to-noisy data curriculum (i.e., training first on denoised, then noisy samples) for pretraining a SSL backbone (e.g., DINOv2), combined with a teacher-guided regularization that anchors noisy embeddings to their denoised counterparts. This process encourages the model to internalize noise robustness. Notably, the denoiser can be discarded after pretraining, simplifying deployment. On ImageNet-1k with ViT-B under extreme Gaussian noise ($\sigma=255$, SNR = 0.72 dB), our method improves linear probing accuracy by 4.8\% over DINOv2, demonstrating that denoiser-free robustness can emerge from noise-aware pretraining. The code is available at https://github.com/wenquanlu/noisy_dinov2.
Consensus-Robust Transfer Attacks via Parameter and Representation Perturbations
Shixin Li · Zewei Li · Xiaojing Ma · Xiaofan Bai · Pingyi Hu · Dongmei Zhang · Bin Zhu
Adversarial examples crafted on one model often exhibit poor transferability to others, hindering their effectiveness in black-box settings. This limitation arises from two key factors: (i) \emph{decision-boundary variation} across models and (ii) \emph{representation drift} in feature space. We address these challenges through a new perspective that frames transferability for \emph{untargeted attacks} as a \emph{consensus-robust optimization} problem: adversarial perturbations should remain effective across a neighborhood of plausible target models. To model this uncertainty, we introduce two complementary perturbation channels: a \emph{parameter channel}, capturing boundary shifts via weight perturbations, and a \emph{representation channel}, addressing feature drift via stochastic blending of clean and adversarial activations. We then propose \emph{CORTA} (COnsensus--Robust Transfer Attack), a lightweight attack instantiated from this robust formulation using two first-order strategies: (i) sensitivity regularization based on the squared Frobenius norm of logits’ Jacobian with respect to weights, and (ii) Monte Carlo sampling for blended feature representations. Our theoretical analysis provides a certified lower bound linking these approximations to the robust objective. Extensive experiments on CIFAR-100 and ImageNet show that CORTA significantly outperforms state-of-the-art transfer-based methods---including ensemble approaches---across CNN and Vision Transformer targets. Notably, CORTA achieves a \emph{19.1 percentage-point gain in transfer success rate over the best prior method} while using only a single surrogate model.
MIBP-Cert: Certified Training against Data Perturbations with Mixed-Integer Bilinear Programs
Tobias Lorenz · Marta Kwiatkowska · Mario Fritz
Data errors, corruptions, and poisoning attacks during training pose a major threat to the reliability of modern AI systems. While extensive effort has gone into empirical mitigations, the evolving nature of attacks and the complexity of data require a more principled, provable approach to robustly learn on such data—and to understand how perturbations influence the final model. Hence, we introduce MIBP-Cert, a novel certification method based on mixed-integer bilinear programming (MIBP) that computes sound, deterministic bounds to provide provable robustness even under complex threat models. By computing the set of parameters reachable through perturbed or manipulated data, we can predict all possible outcomes and guarantee robustness. To make solving this optimization problem tractable, we propose a novel relaxation scheme that bounds each training step without sacrificing soundness. We demonstrate the applicability of our approach to continuous and discrete data, as well as different threat models—including complex ones that were previously out of reach.
Defending Multimodal Backdoored Models by Repulsive Visual Prompt Tuning
Zhifang Zhang · Shuo He · Haobo Wang · Bingquan Shen · Lei Feng
Multimodal contrastive learning models (e.g., CLIP) can learn high-quality representations from large-scale image-text datasets, while they exhibit significant vulnerabilities to backdoor attacks, raising serious safety concerns. In this paper, we reveal that CLIP's vulnerabilities primarily stem from its tendency to encode features beyond in-dataset predictive patterns, compromising its visual feature resistivity to input perturbations. This makes its encoded features highly susceptible to being reshaped by backdoor triggers. To address this challenge, we propose Repulsive Visual Prompt Tuning (RVPT), a novel defense approach that employs deep visual prompt tuning with a specially designed feature-repelling loss. Specifically, RVPT adversarially repels the encoded features from deeper layers while optimizing the standard cross-entropy loss, ensuring that only predictive features in downstream tasks are encoded, thereby enhancing CLIP’s visual feature resistivity against input perturbations and mitigating its susceptibility to backdoor attacks. Unlike existing multimodal backdoor defense methods that typically require the availability of poisoned data or involve fine-tuning the entire model, RVPT leverages few-shot downstream clean samples and only tunes a small number of parameters. Empirical results demonstrate that RVPT tunes only 0.27\% of the parameters in CLIP, yet it significantly outperforms state-of-the-art defense methods, reducing the attack success rate from 89.70\% to 2.76\% against the most advanced multimodal attacks on ImageNet and effectively generalizes its defensive capabilities across multiple datasets. Our code is available on https://anonymous.4open.science/r/rvpt-anonymous.
Robust LLM Alignment via Distributionally Robust Direct Preference Optimization
Zaiyan Xu · Sushil Vemuri · Kishan Panaganti · Dileep Kalathil · Rahul Jain · Deepak Ramachandran
A major challenge in aligning large language models (LLMs) with human preferences is the issue of distribution shift. LLM alignment algorithms rely on static preference datasets, assuming that they accurately represent real-world user preferences. However, user preferences vary significantly across geographical regions, demographics, linguistic patterns, and evolving cultural trends. This preference distribution shift leads to catastrophic alignment failures in many real-world applications. We address this problem using the principled framework of distributionally robust optimization, and develop two novel distributionally robust direct preference optimization (DPO) algorithms, namely, Wasserstein DPO (WDPO) and Kullback–Leibler DPO (KLDPO). We characterize the sample complexity of learning the optimal policy parameters for WDPO and KLDPO. Moreover, we propose scalable gradient descent-style learning algorithms by developing suitable approximations for the challenging minimax loss functions of WDPO and KLDPO. Our empirical experiments using benchmark data sets and LLMs demonstrate the superior performance of WDPO and KLDPO in substantially improving the alignment when there is a preference distribution shift.
From Faults to Features: Pretraining to Learn Robust Representations against Sensor Failures
Jens U. Brandt · Noah Christoph Pütz · Marcus Greiff · Thomas Lew · John Subosits · Marc Hilbert · Thomas Bartz-Beielstein
Machine learning models play a key role in safety-critical applications, such as autonomous vehicles and advanced driver assistance systems, where their robustness during inference is essential to ensure reliable operation. Sensor faults, however, can corrupt input signals, potentially leading to severe model failures that compromise reliability. In this context, pretraining emerges as a powerful approach for learning expressive representations applicable to various downstream tasks. Among existing techniques, masking represents a promising direction for learning representations that are robust to corrupted input data. In this work, we extend this concept by specifically targeting robustness to sensor outages during pretraining. We propose a self-supervised masking scheme that simulates common sensor failures and explicitly trains the model to recover the original signal. We demonstrate that the resulting representations significantly improve the robustness of predictions to seen and unseen sensor failures on a vehicle dynamics dataset, maintaining strong downstream performance under both nominal and various fault conditions. As a practical application, we deploy the method on a modified Lexus LC 500 and show that the pretrained model successfully operates as a substitute for a physical sensor in a closed-loop control system. In this autonomous racing application, a supervised baseline trained without sensor failures may cause the vehicle to leave the track. In contrast, a model trained using the proposed masking scheme enables reliable racing performance in the presence of sensor failures.
Dynamical Low-Rank Compression of Neural Networks with Robustness under Adversarial Attacks
Steffen Schotthöfer · Lexie Yang · Stefan Schnake
Deployment of neural networks on resource-constrained devices demands models that are both compact and robust to adversarial inputs. However, compression and adversarial robustness often conflict. In this work, we introduce a dynamical low-rank training scheme enhanced with a novel spectral regularizer that controls the condition number of the low-rank core in each layer. This approach mitigates the sensitivity of compressed models to adversarial perturbations without sacrificing clean accuracy. The method is model- and data-agnostic, computationally efficient, and supports rank adaptivity to automatically compress the network at hand. Extensive experiments across standard architectures, datasets, and adversarial attacks show the regularized networks can achieve over 94 compression while recovering or improving adversarial accuracy relative to uncompressed baselines.
Impact of Layer Norm on Memorization and Generalization in Transformers
Rishi Singhal · Jung-Eun Kim
Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, Pre-LayerNorm transformers have become the preferred choice over Post-LayerNorm transformers due to their stable gradient flow. However, the impact of LayerNorm on learning and memorization across these architectures remains unclear. In this work, we investigate how LayerNorm influences memorization and learning for Pre- and Post-LayerNorm transformers. We identify that LayerNorm serves as a key factor for stable learning in Pre-LayerNorm transformers, while in Post-LayerNorm transformers, it impacts memorization. Our analysis reveals that eliminating LayerNorm parameters in Pre-LayerNorm models exacerbates memorization and destabilizes learning, while in Post-LayerNorm models, it effectively mitigates memorization by restoring genuine labels. We further precisely identify that early layers LayerNorm are the most critical over middle/later layers and their influence varies across Pre and Post LayerNorm models. We have validated it through 13 models across 6 Vision and Language datasets. These insights shed new light on the role of LayerNorm in shaping memorization and learning in transformers.
TransferBench: Benchmarking Ensemble-based Black-box Transfer Attacks
Fabio Brau · Maura Pintor · Antonio Cinà · Raffaele Mura · Luca Scionis · Luca Oneto · Fabio Roli · Battista Biggio
Ensemble-based black-box transfer attacks optimize adversarial examples on a set of surrogate models, claiming to reach high success rates by querying the (unknown) target model only a few times. In this work, we show that prior evaluations are systematically biased, as such methods are tested only under overly optimistic scenarios, without considering (i) how the choice of surrogate models influences transferability, (ii) how they perform against robust target models, and (iii) whether querying the target to refine the attack is really required.To address these gaps, we introduce TransferBench, a framework for evaluating ensemble-based black-box transfer attacks under more realistic and challenging scenarios than prior work. Our framework considers 17 distinct settings on CIFAR-10 and ImageNet, including diverse surrogate-target combinations, robust targets, and comparisons to baseline methods that do not use any query-based refinement mechanism. Our findings reveal that existing methods fail to generalize to more challenging scenarios, and that query-based refinement offers little to no benefit, contradicting prior claims. These results highlight that building reliable and query-efficient black-box transfer attacks remains an open challenge. We release our benchmark and evaluation code at: https://github.com/pralab/transfer-bench.
Learning to Flow from Generative Pretext Tasks for Neural Architecture Encoding
Sunwoo Kim · Hyunjin Hwang · Kijung Shin
The performance of a deep learning model on a specific task and dataset depends heavily on its neural architecture, motivating considerable efforts to rapidly and accurately identify architectures suited to the target task and dataset. To achieve this, researchers use machine learning models—typically neural architecture encoders—to predict the performance of a neural architecture. Many state-of-the-art encoders aim to capture information flow within a neural architecture, which reflects how information moves through the forward pass and backpropagation, via a specialized model structure. However, due to their complicated structures, these flow-based encoders are significantly slower to process neural architectures compared to simpler encoders, presenting a notable practical challenge. To address this, we propose FGP, a novel pre-training method for neural architecture encoding that trains an encoder to capture the information flow without requiring specialized model structures. FGP trains an encoder to reconstruct a flow surrogate, our proposed representation of the neural architecture's information flow. Our experiments show that FGP boosts encoder performance by up to 106\% in Precision@1\%, compared to the same encoder trained solely with supervised learning.
Deeper with Riemannian Geometry: Overcoming Oversmoothing and Oversquashing for Graph Foundation Models
Li Sun · Zhenhao Huang · Ming Zhang · Philip S Yu
Message Passing Neural Networks (MPNNs) are the building block of graph foundation models, but fundamentally suffer from oversmoothing and oversquashing. There has recently been a surge of interest in fixing both issues. Existing efforts primarily adopt global approaches, which may be beneficial in some regions but detrimental in others, ultimately leading to the suboptimal expressiveness. In this paper, we begin by revisiting oversquashing through a global measure -- spectral gap $\lambda$ -- and prove that the increase of $\lambda$ leads to gradient vanishing with respect to the input features, thereby undermining the effectiveness of message passing. Motivated by such theoretical insights, we propose a local approach that adaptively adjusts message passing based on local structures. To achieve this, we connect local Riemannian geometry with MPNNs, and establish a novel nonhomogeneous boundary condition to address both oversquashing and oversmoothing. Building on the Robin condition, we design a GBN network with local bottleneck adjustment, coupled with theoretical guarantees. Extensive experiments on homophilic and heterophilic graphs show the expressiveness of GBN. Furthermore, GBN does not exhibit performance degradation even when the network depth exceeds $256$ layers.
GD$^2$: Robust Graph Learning under Label Noise via Dual-View Prediction Discrepancy
Kailai Li · Jiong Lou · Jiawei Sun · Honghong Zeng · Wen Li · Chentao Wu · Yuan Luo · Wei Zhao · shouguo du · Jie LI
Graph Neural Networks (GNNs) achieve strong performance in node classification tasks but exhibit substantial performance degradation under label noise. Despite recent advances in noise-robust learning, a principled approach that exploits the node-neighbor interdependencies inherent in graph data for label noise detection remains underexplored. To address this gap, we propose GD$^2$, a noise-aware \underline{G}raph learning framework that detects label noise by leveraging \underline{D}ual-view prediction \underline{D}iscrepancies. The framework contrasts the \textit{ego-view}, constructed from node-specific features, with the \textit{structure-view}, derived through the aggregation of neighboring representations. The resulting discrepancy captures disruptions in semantic coherence between individual node representations and the structural context, enabling effective identification of mislabeled nodes. Building upon this insight, we further introduce a view-specific training strategy that enhances noise detection by amplifying prediction divergence through differentiated view-specific supervision. Extensive experiments on multiple datasets and noise settings demonstrate that \name~achieves superior performance over state-of-the-art baselines.
MaNGO — Adaptable Graph Network Simulators via Meta-Learning
Philipp Dahlinger · Tai Hoang · Denis Blessing · Niklas Freymuth · Gerhard Neumann
Accurately simulating physics is crucial across scientific domains, with applications spanning from robotics to materials science. While traditional mesh-based simulations are precise, they are often computationally expensive and require knowledge of physical parameters, such as material properties. In contrast, data-driven approaches like Graph Network Simulators (GNSs) offer faster inference but suffer from two key limitations: Firstly, they must be retrained from scratch for even minor variations in physical parameters, and secondly they require labor-intensive data collection for each new parameter setting. This is inefficient, as simulations with varying parameters often share a common underlying latent structure. In this work, we address these challenges by learning this shared structure through meta-learning, enabling fast adaptation to new physical parameters without retraining. To this end, we propose a novel architecture that generates a latent representation by encoding graph trajectories using conditional neural processes (CNPs). To mitigate error accumulation over time, we combine CNPs with a novel neural operator architecture. We validate our approach, Meta Neural Graph Operator (MaNGO), on several dynamics prediction tasks with varying material properties, demonstrating superior performance over existing GNS methods. Notably, MaNGO achieves accuracy on unseen material properties close to that of an oracle model.
Rethinking Tokenized Graph Transformers for Node Classification
Jinsong Chen · Chenyang Li · Gaichao Li · John Hopcroft · Kun He
Node tokenized graph Transformers (GTs) have shown promising performance in node classification. The generation of token sequences is the key module in existing tokenized GTs which transforms the input graph into token sequences, facilitating the node representation learning via Transformer. In this paper, we observe that the generations of token sequences in existing GTs only focus on the first-order neighbors on the constructed similarity graphs, which leads to the limited usage of nodes to generate diverse token sequences, further restricting the potential of tokenized GTs for node classification. To this end, we propose a new method termed SwapGT. SwapGT first introduces a novel token swapping operation based on the characteristics of token sequences that fully leverages the semantic relevance of nodes to generate more informative token sequences. Then, SwapGT leverages a Transformer-based backbone to learn node representations from the generated token sequences. Moreover, SwapGT develops a center alignment loss to constrain the representation learning from multiple token sequences, further enhancing the model performance. Extensive empirical results on various datasets showcase the superiority of SwapGT for node classification. Code is available at https://github.com/JHL-HUST/SwapGT.
What Can RL Bring to VLA Generalization? An Empirical Study
Jijia Liu · Feng Gao · Bingwen Wei · Xinlei Chen · Qingmin Liao · YI WU · Chao Yu · Yu Wang
Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at https://rlvla.github.io
Unveiling Transformer Perception by Exploring Input Manifolds
Alessandro Benfenati · Alfio Ferrara · Alessio Marta · Davide Riva · Elisabetta Rocchetti
This paper introduces a general method for the exploration of equivalence classes in the input space of Transformer models. The proposed approach is based on sound mathematical theory which describes the internal layers of a Transformer architecture as sequential deformations of the input manifold. Using eigendecomposition of the pullback of the distance metric defined on the output space through the Jacobian of the model, we are able to reconstruct equivalence classes in the input space and navigate across them. Our method enables two complementary exploration procedures: the first retrieves input instances that produce the same class probability distribution as the original instance—thus identifying elements within the same equivalence class—while the second discovers instances that yield a different class probability distribution, effectively navigating toward distinct equivalence classes. Finally, we demonstrate how the retrieved instances can be meaningfully interpreted by projecting their embeddings back into a human-readable format.
Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers
Peter Súkeník · Christoph Lampert · Marco Mondelli
The empirical emergence of neural collapse---a surprising symmetry in the feature representations of the training data in the penultimate layer of deep neural networks---has spurred a line of theoretical research aimed at its understanding. However, existing work focuses on data-agnostic models or, when data structure is taken into account, it remains limited to multi-layer perceptrons. Our paper fills both these gaps by analyzing modern architectures in data-aware regime: we prove that global optima of deep regularized transformers and residual networks (ResNets) with LayerNorm trained with cross entropy or mean squared error loss are approximately collapsed, and the approximation gets tighter as the depth grows. More generally, we formally reduce any end-to-end large-depth ResNet or transformer training into an equivalent unconstrained features model, thus justifying its wide use in the literature even beyond data-agnostic settings. Our theoretical results are supported by experiments on computer vision and language datasets showing that, as the depth grows, neural collapse indeed becomes more prominent.
Better NTK Conditioning: A Free Lunch from (ReLU) Nonlinear Activation in Wide Neural Networks
Chaoyue Liu · Han Bi · Like Hui · Xiao Liu
Nonlinear activation functions are widely recognized for enhancing the expressivity of neural networks, which is the primary reason for their widespread implementation. In this work, we focus on ReLU activation and reveal a novel and intriguing property of nonlinear activations. By comparing enabling and disabling the nonlinear activations in the neural network, we demonstrate their specific effects on wide neural networks: (a) better feature separation, i.e., a larger angle separation for similar data in the feature space of model gradient, and (b) better NTK conditioning, i.e., a smaller condition number of neural tangent kernel (NTK). Furthermore, we show that the network depth (i.e., with more nonlinear activation operations) further amplifies these effects; in addition, in the infinite-width-then-depth limit, all data are equally separated with a fixed angle in the model gradient feature space, regardless of how similar they are originally in the input space. Note that, without the nonlinear activation, i.e., in a linear neural network, the data separation remains the same as for the original inputs and NTK condition number is equivalent to the Gram matrix, regardless of the network depth. Due to the close connection between NTK condition number and convergence theories, our results imply that nonlinear activation helps to improve the worst-case convergence rates of gradient based methods.
Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon
Tongtong Liang · Dan Qiao · Yu-Xiang Wang · Rahul Parhi
We study the implicit bias of flatness / low (loss) curvature and its effects on generalization in two-layer overparameterized ReLU networks with multivariate inputs---a problem well motivated by the minima stability and edge-of-stability phenomena in gradient-descent training. Existing work either requires interpolation or focuses only on univariate inputs. This paper presents new and somewhat surprising theoretical results for multivariate inputs. On two natural settings (1) generalization gap for flat solutions, and (2) mean-squared error (MSE) in nonparametric function estimation by stable minima, we prove upper and lower bounds, which establish that while flatness does imply generalization, the resulting rates of convergence necessarily deteriorate exponentially as the input dimension grows. This gives an exponential separation between the flat solutions compared to low-norm solutions (i.e., weight decay), which are known not to suffer from the curse of dimensionality. In particular, our minimax lower bound construction, based on a novel packing argument with boundary-localized ReLU neurons, reveals how flat solutions can exploit a kind of "neural shattering" where neurons rarely activate, but with high weight magnitudes. This leads to poor performance in high dimensions. We corroborate these theoretical findings with extensive numerical simulations. To the best of our knowledge, our analysis provides the first systematic explanation for why flat minima may fail to generalize in high dimensions.
Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities
Mayank Jobanputra · Yana Veitsman · Yash Sarrof · Aleksandra Bakalova · Vera Demberg · Ellie Pavlick · Michael Hahn
Transformers have theoretical limitations in modeling certain sequence-to-sequence tasks, yet it remains largely unclear if these limitations play a role in large-scale pretrained LLMs, or whether LLMs might effectively overcome these constraints in practice due to the scale of both the models themselves and their pretraining data. We explore how these architectural constraints manifest after pretraining by studying a family of retrieval and copying tasks inspired by Liu et al. [2024a]. We use a recently proposed framework for studying length generalization [Huang et al., 2025] to provide guarantees for each of our settings. Empirically, we observe an induction-versus-anti-induction asymmetry, where pretrained models are better at retrieving tokens to the right (induction) rather than the left (anti-induction) of a query token. This asymmetry disappears upon targeted fine-tuning if length-generalization is guaranteed by theory. Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers. We validate our findings through practical experiments on real-world tasks demonstrating reliability risks. Our results highlight that pretraining selectively enhances certain transformer capabilities, but does not overcome fundamental length-generalization limits.
Contrastive Learning with Data Misalignment: Feature Purity, Training Dynamics and Theoretical Generalization Guarantees
Jiawei Sun · Shuai Zhang · Hongkang Li · Meng Wang
Contrastive learning is a powerful framework for learning discriminative representations from image-text pairs. Despite its success, its theoretical foundations, especially when the image-text pair exhibits misalignment, remain underexplored. This paper provides the first theoretical analysis of contrastive learning under data misalignment, proving how the ground-truth modality-paired features are amplified while spurious features are suppressed through the training dynamics analysis. Specifically, we study two nonlinear encoders trained jointly with a contrastive loss and demonstrate that noisy (or misaligned) data pairs result in mixed representations and degrade the model's generalization ability. In contrast, recaptioning and filtering improve the data alignment, which in turn purifies the features learned by neurons and subsequently enhances generalization. Our analysis identifies feature purity as a key factor in the success of contrastive learning and offers insights into how data quality and training procedures impact representation learning and downstream generalization. Theoretical insights are supported by experiments on standard benchmarks.
Agnostic Learning under Targeted Poisoning: Optimal Rates and the Role of Randomness
Bogdan Chornomaz · Yonatan Koren · Shay Moran · Tom Waknine
We study the problem of learning in the presence of an adversary that can corrupt an $\eta$ fraction of the training examples with the goal of causing failure on a specific test point. In the realizable setting, prior work established that the optimal error under such instance-targeted poisoning attacks scales as $\Theta(d\eta)$, where $d$ is the VC dimension of the hypothesis class [Hanneke, Karbasi, Mahmoody, Mehalel, and Moran (NeurIPS 2022)]. In this work, we resolve the corresponding question in the agnostic setting. We show that the optimal excess error is $\widetilde\Theta(\sqrt{d\eta})$, answering one of the main open problems left by Hanneke et al. To achieve this rate, it is necessary to use randomized learners: Hanneke et al.\ showed that deterministic learners can be forced to suffer error close to $1$ even under small amounts of poisoning. Perhaps surprisingly, our upper bound remains valid even when the learner’s random bits are fully visible to the adversary. In the other direction, our lower bound is stronger than standard PAC-style bounds: instead of tailoring a hard distribution separately for each sample size, we exhibit a single fixed distribution under which the adversary can enforce an excess error of $\Omega(\sqrt{d\eta})$ infinitely often.
The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks
Vittorio Erba · Emanuele Troiani · Lenka Zdeborová · Florent Krzakala
We study the high-dimensional asymptotics of empirical risk minimization (ERM) in over-parametrized two-layer neural networks with quadratic activations trained on synthetic data. We derive sharp asymptotics for both training and test errors by mapping the $\ell_2$-regularized learning problem to a convex matrix sensing task with nuclear norm penalization. This reveals that capacity control in such networks emerges from a low-rank structure in the learned feature maps. Our results characterize the global minima of the loss and yield precise generalization thresholds, showing how the width of the target function governs learnability. This analysis bridges and extends ideas from spin-glass methods, matrix factorization, and convex optimization and emphasizes the deep link between low-rank matrix sensing and learning in quadratic neural networks.
Identifiability of Deep Polynomial Neural Networks
Konstantin Usevich · Ricardo Borsoi · Clara Dérand · Marianne Clausel
Polynomial Neural Networks (PNNs) possess a rich algebraic and geometric structure. However, their identifiability-a key property for ensuring interpretability-remains poorly understood. In this work, we present a comprehensive analysis of the identifiability of deep PNNs, including architectures with and without bias terms. Our results reveal an intricate interplay between activation degrees and layer widths in achieving identifiability. As special cases, we show that architectures with non-increasing layer widths are generically identifiable under mild conditions, while encoder-decoder networks are identifiable when the decoder widths do not grow too rapidly compared to the activation degrees. Our proofs are constructive and center on a connection between deep PNNs and low-rank tensor decompositions, and Kruskal-type uniqueness theorems. We also settle an open conjecture on the dimension of PNN's neurovarieties, and provide new bounds on the activation degrees required for it to reach the expected dimension.
Attention-based clustering
Rodrigo Maulen Soto · Pierre Marion · Claire Boyer
Transformers have emerged as a powerful neural network architecture capable of tackling a wide range of learning tasks. In this work, we provide a theoretical analysis of their ability to automatically extract structure from data in an unsupervised setting. In particular, we demonstrate their suitability for clustering when the input data is generated from a Gaussian mixture model. To this end, we study a simplified two-head attention layer and define a population risk whose minimization with unlabeled data drives the head parameters to align with the true mixture centroids.
Intrinsic Benefits of Categorical Distributional Loss: Uncertainty-aware Regularized Exploration in Reinforcement Learning
Ke Sun · Yingnan Zhao · Enze Shi · Yafei Wang · Xiaodong Yan · Bei Jiang · Linglong Kong
The remarkable empirical performance of distributional reinforcement learning~(RL) has garnered increasing attention to understanding its theoretical advantages over classical RL. By decomposing the categorical distributional loss commonly employed in distributional RL, we find that the potential superiority of distributional RL can be attributed to a derived distribution-matching entropy regularization. This less-studied entropy regularization aims to capture additional knowledge of return distribution beyond only its expectation, contributing to an augmented reward signal in policy optimization. In contrast to the vanilla entropy regularization in MaxEnt RL, which explicitly encourages exploration by promoting diverse actions, the novel entropy regularization derived from categorical distributional loss implicitly updates policies to align the learned policy with (estimated) environmental uncertainty. Finally, extensive experiments verify the significance of this uncertainty-aware regularization from distributional RL on the empirical benefits over classical RL. Our study offers an innovative exploration perspective to explain the intrinsic benefits of distributional learning in RL.
Spectral Conditioning of Attention Improves Transformer Performance
Hemanth Saratchandran · Simon Lucey
We present a theoretical analysis of the Jacobian of a attention block within a transformer, showing that it is governed by the query, key, and value projections that define the attention mechanism. Leveraging this insight, we introduce a method that systematically alters the spectral properties of each attention layer to reduce the Jacobian’s condition number, thereby improving the overall conditioning of the attention layers within a transformer network. We empirically show that this improved Jacobian conditioning translates to enhanced performance in practice. Our approach is simple, broadly applicable, and can be easily integrated as a drop-in replacement for a wide range of existing attention mechanisms. We validate its effectiveness across diverse transformer architectures and tasks, demonstrating consistent improvements in performance.
Is Grokking a Computational Glass Relaxation?
Xiaotian Zhang · Yue Shang · Entao Yang · Ge Zhang
Understanding neural network' (NN) generalizability remains a central question in deep learning research. The special phenomenon of grokking, where NNs abruptly generalize long after the training performance reaches near-perfect level, offers a unique window to investigate the underlying mechanisms of NNs' generalizability. Here we propose an interpretation for grokking by framing it as a computational glass relaxation: viewing NNs as a physical system where parameters are the degrees of freedom and train loss is the system energy, we find memorization process resembles a rapid cooling of liquid into non-equilibrium glassy state at low temperature and the later generalization is like a slow relaxation towards a more stable configuration. This mapping enables us to sample NNs' Boltzmann entropy (states of density) landscape as a function of training loss and test accuracy. Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition. We identify a high-entropy advantage under grokking, an extension of prior work linking entropy to generalizability but much more significant. Inspired by grokking's far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions. This provides strictly-defined counterexamples to theory attributing grokking solely to weight norm evolution towards the Goldilocks zone and also suggests new potential ways for optimizer design.
Partial Information Decomposition via Normalizing Flows in Latent Gaussian Distributions
Wenyuan Zhao · Adithya Balachandran · Chao Tian · Paul Liang
The study of multimodality has garnered significant interest in fields where analyzing interactions among multiple information sources can enhance predictive modeling, data fusion, and interpretability. Partial information decomposition (PID) has emerged as a useful information-theoretic framework to quantify the degree to which individual modalities independently, redundantly, or synergistically convey information about a target variable. However, existing PID methods depend on optimizing over a joint distribution constrained by estimated pairwise probability distributions, which are costly and inaccurate for continuous and high-dimensional modalities. Our first key insight is that the problem can be solved efficiently when the pairwise distributions are multivariate Gaussians, and we refer to this problem as Gaussian PID (GPID). We propose a new gradient-based algorithm that substantially enhances computational efficiency for GPID based on an alternative formulation of the underlying optimization problem. To generalize the applicability to non-Gaussian data, we learn information-preserving encoders to transform random variables of arbitrary input distributions into pairwise Gaussian random variables. Along the way, we resolved an open problem regarding the optimality of joint Gaussian solutions for GPID. Empirical validation on diverse synthetic examples demonstrates that our proposed method provides more accurate and efficient PID estimates than existing baselines. We further evaluate on a series of large-scale multimodal benchmarks to show its utility in real-world applications of quantifying PID in multimodal datasets and selecting high-performing models.
Depth-Bounds for Neural Networks via the Braid Arrangement
Moritz Grillo · Christoph Hertrich · Georg Loho
We contribute towards resolving the open question of how many hidden layers are required in ReLU networks for exactly representing all continuous and piecewise linear functions on $\mathbb{R}^d$. While the question has been resolved in special cases, the best known lower bound in general is still 2. We focus on neural networks that are compatible with certain polyhedral complexes, more precisely with the braid fan. For such neural networks, we prove a non-constant lower bound of $\Omega(\log\log d)$ hidden layers required to exactly represent the maximum of $d$ numbers. Additionally, we provide a combinatorial proof that neural networks satisfying this assumption require three hidden layers to compute the maximum of 5 numbers; this had only been verified with an excessive computation so far. Finally, we show that a natural generalization of the best known upper bound to maxout networks is not tight, by demonstrating that a rank-3 maxout layer followed by a rank-2 maxout layer is sufficient to represent the maximum of 7 numbers.
Continuity and Isolation Lead to Doubts or Dilemmas in Large Language Models
Hector Pasten · Felipe Urrutia · Hector Orellana · Cristian Buc Calderon · Cristobal Rojas · Alexander Kozachinskiy
Understanding how Transformers work and how they process information is key to the theoretical and empirical advancement of these machines. In this work, we demonstrate the existence of two phenomena in Transformers, namely isolation and continuity. Both of these phenomena hinder Transformers to learn even simple pattern sequences. Isolation expresses that any learnable sequence must be isolated from another learnable sequence, and hence some sequences cannot be learned by a single Transformer at the same time. Continuity entails that an attractor basin forms around a learned sequence, such that any sequence falling in that basin will collapse towards the learned sequence. Here, we mathematically prove these phenomena emerge in all Transformers that use compact positional encoding, and design rigorous experiments, demonstrating that the theoretical limitations we shed light on occur on the practical scale.
Learning to Add, Multiply, and Execute Algorithmic Instructions Exactly with Neural Networks
Artur Back de Luca · George Giapitzakis · Kimon Fountoulakis
Neural networks are known for their ability to approximate smooth functions, yet they fail to generalize perfectly to unseen inputs when trained on discrete operations. Such operations lie at the heart of algorithmic tasks such as arithmetic, which is often used as a test bed for algorithmic execution in neural networks. In this work, we ask: can neural networks learn to execute binary-encoded algorithmic instructions exactly? We use the Neural Tangent Kernel (NTK) framework to study the training dynamics of two-layer fully connected networks in the infinite-width limit and show how a sufficiently large ensemble of such models can be trained to execute exactly, with high probability, four fundamental tasks: binary permutations, binary addition, binary multiplication, and Subtract and Branch if Negative (SBN) instructions. Since SBN is Turing-complete, our framework extends to computable functions. We show how this can be efficiently achieved using only logarithmically many training data. Our approach relies on two techniques: structuring the training data to isolate bit-level rules, and controlling correlations in the NTK regime to align model predictions with the target algorithmic executions.
Emergence and scaling laws in SGD learning of shallow neural networks
Yunwei Ren · Eshaan Nichani · Denny Wu · Jason Lee
We study the complexity of online stochastic gradient descent (SGD) for learning a two-layer neural network with $P$ neurons on isotropic Gaussian data: $f_*(\boldsymbol{x}) = \sum_{p=1}^P a_p\cdot \sigma(\langle\boldsymbol{x},\boldsymbol{v_p}^{\star}\rangle)$, $\boldsymbol{x} \sim \mathcal{N}(0,\boldsymbol{I_d})$, where the activation $\sigma:\mathbb{R}\to\mathbb{R}$ is an even function with information exponent $k>2$ (defined as the lowest degree in the Hermite expansion), $\lbrace\boldsymbol{v_p}^\star\rbrace_{p\in[P]}\subset \mathbb{R}^d$ are orthonormal signal directions, and the non-negative second-layer coefficients satisfy $\sum_{p} a_p^2=1$. We focus on the challenging extensive-width regime $P\gg 1$ and permit diverging condition number in the second-layer, covering as a special case the power-law scaling $a_p\asymp p^{-\beta}$ where $\beta\in\mathbb{R}_{\ge 0}$. We provide a precise analysis of SGD dynamics for the training of a student two-layer network to minimize the mean squared error (MSE) objective, and explicitly identify sharp transition times to recover each signal direction. In the power-law setting, we characterize scaling law exponents for the MSE loss with respect to the number of training samples and SGD steps, as well as the number of parameters in the student neural network. Our analysis entails that while the learning of individual teacher neurons exhibits abrupt transitions, the juxtaposition of $P\gg 1$ emergent learning curves at different timescales leads to a smooth scaling law in the cumulative objective.
Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking
Ting Han · Linara Adilova · Henning Petzka · Jens Kleesiek · Michael Kamp
Neural collapse, i.e., the emergence of highly symmetric, class-wise clustered representations, is frequently observed in deep networks and is often assumed to reflect or enable generalization. In parallel, flatness of the loss landscape has been theoretically and empirically linked to generalization. Yet, the causal role of either phenomenon remains unclear: Are they prerequisites for generalization, or merely by-products of training dynamics? We disentangle these questions using grokking, a training regime in which memorization precedes generalization, allowing us to temporally separate generalization from training dynamics and we find that while both neural collapse and relative flatness emerge near the onset of generalization, only flatness consistently predicts it. Models encouraged to collapse or prevented from collapsing generalize equally well, whereas models regularized away from flat solutions exhibit delayed generalization, resembling grokking, even in architectures and datasets where it does not typically occur. Furthermore, we show theoretically that neural collapse leads to relative flatness under classical assumptions, explaining their empirical co-occurrence. Our results support the view that relative flatness is a potentially necessary and more fundamental property for generalization, and demonstrate how grokking can serve as a powerful probe for isolating its geometric underpinnings.
Compositional Reasoning with Transformers, RNNs, and Chain of Thought
Gilad Yehudai · Noah Amsel · Joan Bruna
It is understood that different neural network architectures are suited to different tasks, but is there always a single best architecture for a given task? We compare the expressive power of transformers, RNNs, and transformers with chain of thought tokens on a simple and natural class of tasks we term Compositional Reasoning Questions (CRQ). This family captures multi-step problems with tree-like compositional structure, such as evaluating Boolean formulas. We prove that under standard hardness assumptions, *none* of these three architectures is capable of solving CRQs unless some hyperparameter (depth, embedding dimension, and number of chain of thought tokens, respectively) grows with the size of the input. We then provide constructions for solving CRQs with each architecture. For transformers, our construction uses depth that is logarithmic in the problem size. For RNNs, logarithmic embedding dimension is necessary and sufficient, so long as the inputs are provided in a certain order. For transformers with chain of thought, our construction uses $n$ CoT tokens. These results show that, while CRQs are inherently hard, there are several different ways for language models to overcome this hardness. Even for a single class of problems, each architecture has strengths and weaknesses, and none is strictly better than the others.
Analyzing the Power of Chain of Thought through Memorization Capabilities
Lijia Yu · Xiao-Shan Gao · Lijun Zhang
It has been shown that the chain of thought (CoT) can enhance the power of LLMs to simulate a Turing machine or an algorithm, and in particular its mathematical reasoning ability. The memorization capability of LLMs is an important aspect of their expressive ability, which offers valuable insight into designing models with enhanced generalization potential. Currently, the optimal memorization capacities of transformers have been established for both the general dataset and the dataset that satisfies a specific separability condition. However, the question of whether the CoT can improve the memorization capability of LLMs remains unexamined. To fill this gap, we establish the memorization capability for fixed-precision autoregressive transformers with or without CoT. Precisely, we first give the necessary and sufficient conditions for transformers to memorize a finite language and then provide the upper and lower bounds for the number of parameters of the memorization transformers. Our result indicates that the classes of languages that can be memorized by transformers with or without CoT do not contain each other, and the same number of parameters is needed for transformers with or without CoT to memorize, implying that CoT does not enhance a transformer’s memorization power significantly. We further show that CoT can not help transformers to memory certain infinite languages.
Learnable Burst-Encodable Time-of-Flight Imaging for High-Fidelity Long-Distance Depth Sensing
Manchao Bao · Shengjiang Fang · Tao Yue · Xuemei Hu
Long-distance depth imaging holds great promise for applications such as autonomous driving and robotics. Direct time-of-flight (dToF) imaging offers high-precision, long-distance depth sensing, yet demands ultra-short pulse light sources and high-resolution time-to-digital converters. In contrast, indirect time-of-flight (iToF) imaging often suffers from phase wrapping and low signal-to-noise ratio (SNR) as the sensing distance increases. In this paper, we introduce a novel ToF imaging paradigm, termed Burst-Encodable Time-of-Flight (BE-ToF), which facilitates high-fidelity, long-distance depth imaging. Specifically, the BE-ToF system emits light pulses in burst mode and estimates the phase delay of the reflected signal over the entire burst period, thereby effectively avoiding the phase wrapping inherent to conventional iToF systems. Moreover, to address the low SNR caused by light attenuation over increasing distances, we propose an end-to-end learnable framework that jointly optimizes the coding functions and the depth reconstruction network. A specialized double well function and first-order difference term are incorporated into the framework to ensure the hardware implementability of the coding functions. The proposed approach is rigorously validated through comprehensive simulations and real-world prototype experiments, demonstrating its effectiveness and practical applicability. The code is available at: https://github.com/ComputationalPerceptionLab/BE-ToF.
Adaptive Context Length Optimization with Low-Frequency Truncation for Multi-Agent Reinforcement Learning
Wenchang Duan · Yaoliang Yu · Jiwan He · Yi Shi
Recently, deep multi-agent reinforcement learning (MARL) has demonstrated promising performance for solving challenging tasks, such as long-term dependencies and non-Markovian environments. Its success is partly attributed to conditioning policies on large fixed context length. However, such large fixed context lengths may lead to limited exploration efficiency and redundant information. In this paper, we propose a novel MARL framework to obtain adaptive and effective contextual information. Specifically, we design a central agent that dynamically optimizes context length via temporal gradient analysis, enhancing exploration to facilitate convergence to global optima in MARL. Furthermore, to enhance the adaptive optimization capability of the context length, we present an efficient input representation for the central agent, which effectively filters redundant information. By leveraging a Fourier-based low-frequency truncation method, we extract global temporal trends across decentralized agents, providing an effective and efficient representation of the MARL environment. Extensive experiments demonstrate that the proposed method achieves state-of-the-art (SOTA) performance on long-term dependency tasks, including PettingZoo, MiniGrid, Google Research Football (GRF), and StarCraft Multi-Agent Challenge v2 (SMACv2).
HyperMARL: Adaptive Hypernetworks for Multi-Agent RL
Kale-ab Tessera · Muhammad Arrasy Rahman · Amos Storkey · Stefano Albrecht
Adaptive cooperation in multi-agent reinforcement learning (MARL) requires policies to express homogeneous, specialised, or mixed behaviours, yet achieving this adaptivity remains a critical challenge. While parameter sharing (PS) is standard for efficient learning, it notoriously suppresses the behavioural diversity required for specialisation. This failure is largely due to cross-agent gradient interference, a problem we find is surprisingly exacerbated by the common practice of coupling agent IDs with observations. Existing remedies typically add complexity through altered objectives, manual preset diversity levels, or sequential updates -- raising a fundamental question: can shared policies adapt without these intricacies? We propose a solution built on a key insight: an agent-conditioned hypernetwork can generate agent-specific parameters and decouple observation- and agent-conditioned gradients, directly countering the interference from coupling agent IDs with observations. Our resulting method, HyperMARL, avoids the complexities of prior work and empirically reduces policy gradient variance. Across diverse MARL benchmarks (22 scenarios, up to 30 agents), HyperMARL achieves performance competitive with six key baselines while preserving behavioural diversity comparable to non-parameter sharing methods, establishing it as a versatile and principled approach for adaptive MARL. The code is publicly available at https://github.com/KaleabTessera/HyperMARL.
Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination
Rakshit Trivedi · Kartik Sharma · David Parkes
Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts. Imitation learning has emerged as one of the prominent approaches to build such agents by training them to mimic human-demonstrated behaviors. However, current methods struggle to capture the inherent diversity and non-Markovian nature of human behavior and lack the ability to steer behavior at inference time. Drawing inspiration from the theory of human cognitive processes, where inner speech guides action selection before execution, we propose MIMIC (Modeling Inner Motivations for Imitation and Control), a framework that uses language as an internal representation of behavioral intent. MIMIC employs the novel use of vision-language models as linguistic scaffolding to train a conditional variational autoencoder capable of generating inner speech from observations. A diffusion-based behavior cloning policy then selects actions conditioned on current observations and the generated inner speech. MIMIC enables fine-grained steering of behavior at inference time by conditioning the agent on behavior-specific speech. Experiments across robotic manipulation tasks and human-AI collaboration games demonstrate that MIMIC significantly enhances both behavior diversity and fidelity to human demonstrations while enabling nuanced behavioral steering without training on additional demonstrations.
Scalable Neural Incentive Design with Parameterized Mean-Field Approximation
Nathan Corecco · Batuhan Yardim · Vinzenz Thoma · Zebang Shen · Niao He
Designing incentives for a multi-agent system to induce a desirable Nash equilibrium is both a crucial and challenging problem appearing in many decision-making domains, especially for a large number of agents $N$. Under the exchangeability assumption, we formalize this incentive design (ID) problem as a parameterized mean-field game (PMFG), aiming to reduce complexity via an infinite-population limit. We first show that when dynamics and rewards are Lipschitz, the finite-$N$ ID objective is approximated by the PMFG at rate $\mathcal{O}(\frac{1}{\sqrt{N}})$. Moreover, beyond the Lipschitz-continuous setting, we prove the same $\mathcal{O}(\frac{1}{\sqrt{N}})$ decay for the important special case of sequential auctions, despite discontinuities in dynamics, through a tailored auction-specific analysis. Built on our novel approximation results, we further introduce our Adjoint Mean-Field Incentive Design (AMID) algorithm, which uses explicit differentiation of iterated equilibrium operators to compute gradients efficiently. By uniting approximation bounds with optimization guarantees, AMID delivers a powerful, scalable algorithmic tool for many-agent (large $N$) ID. Across diverse auction settings, the proposed AMID method substantially increases revenue over first-price formats and outperforms existing benchmark methods.
Planning under opponent uncertainty is a fundamental challenge in multi-agent environments, where an agent must act while inferring the hidden policies of its opponents. Existing type-based methods rely on manually defined behavior classes and struggle to scale, while model-free approaches are sample-inefficient and lack a principled way to incorporate uncertainty into planning. We propose Quantized Opponent Models (QOM), which learn a compact catalog of opponent types via a quantized autoencoder and maintain a Bayesian belief over these types online. This posterior supports both a belief-weighted meta-policy and a Monte-Carlo planning algorithm that directly integrates uncertainty, enabling real-time belief updates and focused exploration. Experiments show that QOM achieves superior performance with lower search cost, offering a tractable and effective solution for belief-aware planning.
ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation
Jiawen Yu · Hairuo Liu · Qiaojun Yu · Jieji Ren · Ce Hao · Haitong Ding · Guangyu Huang · Guofan Huang · Yan Song · Panpan Cai · Wenqiang Zhang · Cewu Lu
Vision-Language-Action (VLA) models have advanced general-purpose robotic manipulation by leveraging pretrained visual and linguistic representations. However, they struggle with contact-rich tasks that require fine-grained control involving force, especially under visual occlusion or dynamic uncertainty. To address these limitations, we propose \textbf{ForceVLA}, a novel end-to-end manipulation framework that treats external force sensing as a first-class modality within VLA systems. ForceVLA introduces \textbf{FVLMoE}, a force-aware Mixture-of-Experts fusion module that dynamically integrates pretrained visual-language embeddings with real-time 6-axis force feedback during action decoding. This enables context-aware routing across modality-specific experts, enhancing the robot's ability to adapt to subtle contact dynamics. We also introduce \textbf{ForceVLA-Data}, a new dataset comprising synchronized vision, proprioception, and force-torque signals across five contact-rich manipulation tasks. ForceVLA improves average task success by 23.2\% over strong $\pi_0$-based baselines, achieving up to 80\% success in tasks such as plug insertion. Our approach highlights the importance of multimodal integration for dexterous manipulation and sets a new benchmark for physically intelligent robotic control. Code and data will be released at https://sites.google.com/view/forcevla2025/.
Spectral Learning for Infinite-Horizon Average-Reward POMDPs
Alessio Russo · Alberto Maria Metelli · Marcello Restelli
We address the learning problem in the context of infinite-horizon average-reward POMDPs. Traditionally, this problem has been approached using $\textit{Spectral Decomposition}$ (SD) methods applied to samples collected under non-adaptive policies, such as uniform or round-robin policies. Recently, SD techniques have been extended to accommodate a restricted class of adaptive policies such as $\textit{memoryless policies}$. However, the use of adaptive policies has introduced challenges related to data inefficiency, as SD methods typically require all samples to be drawn from a single policy. In this work, we propose $\texttt{Mixed Spectral Estimation}$, which generalizes spectral estimation techniques to support a broader class of $\textit{belief-based policies}$. We solve the open question of whether spectral methods can be applied to samples collected from multiple policies, and we provide finite-sample guarantees for our approach under standard observability and ergodicity assumptions. Building on this data-efficient estimation method, we introduce the $\texttt{Mixed Spectral UCRL}$ algorithm. Through a refined theoretical analysis, we demonstrate that it achieves a regret bound of $\widetilde{\mathcal{O}}(\sqrt{T})$ when compared to the optimal policy, without requiring full knowledge of either the transition or the observation model. Finally, we present numerical simulations that validate the theoretical analysis of both the proposed estimation procedure and the $\texttt{Mixed Spectral UCRL}$ algorithm.
Learning Interactive World Model for Object-Centric Reinforcement Learning
Fan Feng · Phillip Lippe · Sara Magliacane
Agents that understand objects and their interactions can learn policies that are more robust and transferable. However, most object-centric RL methods factor state by individual objects while leaving interactions implicit. We introduce the Factored Interactive Object-Centric World Model (FIOC-WM), a unified framework that learns structured representations of both objects and their interactions within a world model. FIOC-WM captures environment dynamics with disentangled and modular representations of object interactions, improving sample efficiency and generalization for policy learning. Concretely, FIOC-WM first learns object-centric latents and an interaction structure directly from pixels, leveraging pre-trained vision encoders. The learned world model then decomposes tasks into composable interaction primitives, and a hierarchical policy is trained on top: a high level selects the type and order of interactions, while a low level executes them. On simulated robotic and embodied-AI benchmarks, FIOC-WM improves policy-learning sample efficiency and generalization over world-model baselines, indicating that explicit, modular interaction learning is crucial for robust control.
SEGA: Shaping Semantic Geometry for Robust Hashing under Noisy Supervision
Yiyang Gu · Bohan Wu · Qinghua Ran · Rong-Cheng Tu · Xiao Luo · Zhiping Xiao · Wei Ju · Dacheng Tao · Ming Zhang
This paper studies the problem of learning hash codes from noisy supervision, which is a practical yet challenging task. This problem is important in extensive real-world applications such as image retrieval and cross-modal retrieval. However, most of the existing methods focus on label denoising to address this problem, but ignore the geometric structure of the hash space, which is critical for learning stable hash codes. Towards this end, this paper proposes a novel framework named Semantic Geometry Shaping (SEGA) that explicitly refines the semantic geometry of hash space. Specifically, we first learn dynamic class prototypes as semantic anchors and cluster hash embeddings around these prototypes to keep structural stability. We then leverage both the energy of predicted distributions and structure-based divergence to estimate the uncertainty of instances and calibrate the supervision in a soft manner. Moreover, we introduce structure-aware interpolation to improve the class boundaries. To verify the effectiveness of our design, we give the theoretical analysis for the proposed framework. Experiments on a range of widely-used retrieval datasets justify the superiority of our SEGA over extensive strong baselines under noisy supervision.
HM3: Hierarchical Multi-Objective Model Merging for Pretrained Models
Yu Zhou · Xingyu Wu · Jibin Wu · Liang Feng · KC Tan
Model merging is a technique that combines multiple large pretrained models into a single model, enhancing performance and broadening task adaptability without original data or additional training. However, most existing model merging methods focus primarily on exploring the parameter space, merging models with identical architectures. Despite its potential, merging in the architecture space remains in its early stages due to the vast search space and challenges related to layer compatibility. This paper designs a hierarchical model merging framework named HM3, formulating a bilevel multi-objective model merging problem across both parameter and architecture spaces. At the parameter level, HM3 integrates existing merging methods to quickly identify optimal parameters. Based on these, an actor-critic strategy with efficient policy discretization is employed at the architecture level to explore inference paths with Markov property in the layer-granularity search space for reconstructing these optimal models. By training reusable policy and value networks, HM3 learns Pareto optimal models to provide customized solutions for various tasks. Experimental results on language and vision tasks demonstrate that HM3 outperforms methods focusing solely on the parameter or architecture space.
Cognitive Mirrors: Exploring the Diverse Functional Roles of Attention Heads in LLM Reasoning
Xueqi Ma · Jun Wang · Yanbei Jiang · Sarah Erfani · Tongliang Liu · James Bailey
Large language models (LLMs) have achieved state-of-the-art performance in a variety of tasks, but remain largely opaque in terms of their internal mechanisms. Understanding these mechanisms is crucial to improve their reasoning abilities. Drawing inspiration from the interplay between neural processes and human cognition, we propose a novel interpretability framework to systematically analyze the roles and behaviors of attention heads, which are key components of LLMs. We introduce CogQA, a dataset that decomposes complex questions into step-by-step subquestions with a chain-of-thought design, each associated with specific cognitive functions such as retrieval or logical reasoning. By applying a multi-label probing method, we identify the attention heads responsible for these functions. Our analysis across multiple LLM families reveals that attention heads exhibit functional specialization, characterized as cognitive heads. These cognitive heads exhibit several key properties: they are universally sparse, and vary in number and distribution across different cognitive functions, and they display interactive and hierarchical structures. We further show that cognitive heads play a vital role in reasoning tasks—removing them leads to performance degradation, while augmenting them enhances reasoning accuracy. These insights offer a deeper understanding of LLM reasoning and suggest important implications for model design, training and fine-tuning strategies.
Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens
Xixian Yong · Xiao Zhou · Yingying Zhang · Jinlin Li · Yefeng Zheng · Xian Wu
The recent rise of Large Reasoning Models (LRMs) has significantly improved multi-step reasoning performance, but often at the cost of generating excessively long reasoning chains. This paper revisits the efficiency of such reasoning processes through an information-theoretic lens, revealing a fundamental trade-off between reasoning length and semantic efficiency. We propose two metrics—InfoBias and InfoGain—to quantify divergence from ideal reasoning paths and stepwise information contribution, respectively. Empirical analyses show that longer reasoning chains tend to exhibit higher information bias and diminishing information gain, especially for incorrect answers. Motivated by these findings, we introduce an entropy-based Adaptive Think strategy that dynamically halts reasoning once confidence is sufficiently high, improving efficiency while maintaining competitive accuracy. Compared to the Vanilla Think approach (default mode), our strategy yields a 1.10% improvement in average accuracy and a 50.80% reduction in token usage on QwQ-32B across six benchmark tasks spanning diverse reasoning types and difficulty levels, demonstrating superior efficiency and reasoning performance. These results underscore the promise of entropy-based methods for enhancing both accuracy and cost-effiiciency in large language model deployment.
On the Loss of Context Awareness in General Instruction Fine-tuning
Yihan Wang · Andrew Bai · Nanyun Peng · Cho-Jui Hsieh
Pre-trained Large Language Models (LLMs) require post-training methods such as supervised fine-tuning (SFT) on instruction-response pairs to enable instruction following. However, this process can cause forgetting in capabilities learned during pre-training. In this paper, we investigate the loss of context awareness after SFT, where context awareness is defined as the ability to extract and understand information from user-provided context. % and respond accordingly. Surprisingly, we discovered that the loss of context awareness occurs in instruction fine-tuned LLMs when the chat template is applied to input prompts. We identify that the performance decline is associated with a bias toward different roles learned during conversational instruction fine-tuning. The bias can be traced to training samples where the assistant response minimally relies on the user-provided instruction. Based on these observations, we propose a metric to identify context-dependent examples from general instruction fine-tuning datasets. We then apply conditional instruction fine-tuning with a context-dependency indicator, enabling the model to preserve context awareness after SFT. Experiments on four context-dependent downstream tasks and three pre-trained LLMs of different sizes show that our method effectively mitigates the loss of context awareness without compromising general instruction-following capabilities.
LLM Layers Immediately Correct Each Other
Arjun Patrawala · Jiahai Feng · Erik Jones · Jacob Steinhardt
Recent methods in language model interpretability employ techniques such as sparse autoencoders to decompose residual stream contributions into linear, semantically-meaningful features. Our work demonstrates that an underlying assumption of these methods—that residual stream contributions build additively upon each other—is insufficient to fully explain model behavior. Specifically, we identify the Transformer Layer Correction Mechanism (TLCM), wherein adjacent transformer layers systematically counteract each other's contributions to the residual stream. TLCM appears in 5 out of 7 major open-source model families and activates across nearly all tokens in diverse texts. To understand TLCM, we show that it emerges during pretraining, operates most strongly on punctuation and numbers, and adaptively calibrates its correction strength based on the preceding layer's output. We further show that TLCM actively corrects a small subspace and promotes other subspaces, different from standard model behavior. We advance the ``propose-and-reject'' hypothesis: layers may propose multiple candidate features, while subsequent layers selectively filter out inappropriate ones. Finally, we discuss how our findings help explain three persistent challenges in feature-based interpretability: why extracted features descriptions often suffer from low specificity; why feature-based interventions for model steering fail at low magnitude; why recent work finds cross-layer transcoders outperform SAEs.
Less is More: Improving LLM Alignment via Preference Data Selection
Xun Deng · Han Zhong · Rui Ai · Fuli Feng · Zheng Wang · Xiangnan He
Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability. Extensive experiments in diverse settings demonstrate the consistently high data efficiency of our approach. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama, Mistral, and Qwen models on the AlpacaEval2 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, revealing the high redundancy in this presumed high-quality data construction manner. These results highlight the potential of data selection strategies for advancing preference optimization.
RAGRouter: Learning to Route Queries to Multiple Retrieval-Augmented Language Models
Jiarui Zhang · Xiangyu Liu · Yong Hu · Chaoyue Niu · Fan Wu · Guihai Chen
Retrieval-Augmented Generation (RAG) significantly improves the performance of Large Language Models (LLMs) on knowledge-intensive tasks. However, varying response quality across LLMs under RAG necessitates intelligent routing mechanisms, which select the most suitable model for each query from multiple retrieval-augmented LLMs via a dedicated router model. We observe that external documents dynamically affect LLMs' ability to answer queries, while existing routing methods, which rely on static parametric knowledge representations, exhibit suboptimal performance in RAG scenarios. To address this, we formally define the new retrieval-augmented LLM routing problem, incorporating the influence of retrieved documents into the routing framework. We propose RAGRouter, a RAG-aware routing design, which leverages document embeddings and RAG capability embeddings with contrastive learning to capture knowledge representation shifts and enable informed routing decisions. Extensive experiments on diverse knowledge-intensive tasks and retrieval settings, covering open and closed-source LLMs, show that RAGRouter outperforms the best individual LLM and existing routing methods. With an extended score-threshold-based mechanism, it also achieves strong performance-efficiency trade-offs under low-latency constraints. The code and data are available at https://github.com/OwwO99/RAGRouter.
RidgeLoRA: Matrix Ridge Enhanced Low-Rank Adaptation of Large Language Models
Junda Zhu · Jun Ai · Yujun Li · Yichun Yin · Yasheng Wang · Lifeng Shang · Qun Liu
As one of the state-of-the-art parameter-efficient fine-tuning~(PEFT) methods, Low-Rank Adaptation (LoRA) enables model optimization with reduced computational cost through trainable low-rank matrix. However, the low-rank nature makes it prone to produce a decrease in the representation ability, leading to suboptimal performance. In order to break this limitation, we propose RidgeLoRA, a lightweight architecture like LoRA that incorporates novel architecture and matrix ridge enhanced full-rank approximation, to match the performance of full-rank training, while eliminating the need for high memory and a large number of parameters to restore the rank of matrices. We provide a rigorous mathematical derivation to prove that RidgeLoRA has a better upper bound on the representations than vanilla LoRA. Furthermore, extensive experiments across multiple domains demonstrate that RidgeLoRA achieves better performance than other LoRA variants, and can even match or surpass full-rank training.
The Leaderboard Illusion
Shivalika Singh · Yiyang Nan · Alex Wang · Daniel Dsouza · Sayash Kapoor · Ahmet Üstün · Sanmi Koyejo · Yuntian Deng · Shayne Longpre · Noah Smith · Beyza Ermis · Marzieh Fadaee · Sara Hooker
Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion.Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we found one provider testing 27 private variants before making one model public at the second position on the leaderboard. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. The top two providers have individually received an estimated 19.2% and 20.4% of all data on the arena. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. With conservative estimates, we show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on ArenaHard, a test set from the arena distribution.Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field.
Robust Reinforcement Learning in Finance: Modeling Market Impact with Elliptic Uncertainty Sets
Shaocong Ma · Heng Huang
In financial applications, reinforcement learning (RL) agents are commonly trained on historical data, where their actions do not influence prices. However, during deployment, these agents trade in live markets where their own transactions can shift asset prices, a phenomenon known as market impact. This mismatch between training and deployment environments can significantly degrade performance. Traditional robust RL approaches address this model misspecification by optimizing the worst-case performance over a set of uncertainties, but typically rely on symmetric structures that fail to capture the directional nature of market impact. To address this issue, we develop a novel class of elliptic uncertainty sets. We establish both implicit and explicit closed-form solutions for the worst-case uncertainty under these sets, enabling efficient and tractable robust policy evaluation. Experiments on single-asset and multi-asset trading tasks demonstrate that our method achieves superior Sharpe ratio and remains robust under increasing trade volumes, offering a more faithful and scalable approach to RL in financial markets.
Linearization Explains Fine-Tuning in Large Language Models
Zahra Rahimi Afzal · Tara Esmaeilbeig · Mojtaba Soltanalian · Mesrob I Ohannessian
Parameter-Efficient Fine-Tuning (PEFT) is a popular class of techniques that strive to adapt large models in a scalable and resource-efficient manner. Yet, the mechanisms underlying their training performance and generalization remain underexplored. In this paper, we provide several insights into such fine-tuning through the lens of linearization. Fine-tuned models are often implicitly encouraged to remain close to the pretrained model. By making this explicit, using an $\ell_2$-distance inductive bias in parameter space, we show that fine-tuning dynamics become equivalent to learning with the positive-definite neural tangent kernel (NTK). We specifically analyze how close the fully linear and the linearized fine-tuning optimizations are, based on the strength of the regularization. This allows us to be pragmatic about how good a model linearization is when fine-tuning large language models (LLMs). When linearization is a good model, our findings reveal a strong correlation between the eigenvalue spectrum of the NTK and the performance of model adaptation. Motivated by this, we give spectral perturbation bounds on the NTK induced by the choice of layers selected for fine-tuning. We empirically validate our theory on Low Rank Adaptation (LoRA) on LLMs. These insights not only characterize fine-tuning but also have the potential to enhance PEFT techniques, paving the way to better informed and more nimble adaptation in LLMs.
Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks
Daniel Kunin · Giovanni Luca Marchetti · Feng Chen · Dhruva Karkada · James Simon · Michael Deweese · Surya Ganguli · Nina Miolane
What features neural networks learn, and how, remains an open question. In this paper, we introduce Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer networks trained from small initialization. Prior works have shown that gradient flow in this regime exhibits a staircase-like loss curve, alternating between plateaus where neurons slowly align to useful directions and sharp drops where neurons rapidly grow in norm. AGF approximates this behavior as an alternating two-step process: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. AGF begins with all neurons dormant, corresponding to an initialization at the origin. At each iteration, a dormant neuron activates, triggering the acquisition of a feature and a drop in the loss. AGF quantifies the order, timing, and magnitude of these drops, matching experiments across several commonly studied architectures. We show that AGF unifies and extends existing saddle-to-saddle analyses in fully connected linear networks and attention-only linear transformers, where the learned features are singular modes and principal components, respectively. In diagonal linear networks, we prove AGF converges to gradient flow in the limit of vanishing initialization. Applying AGF to quadratic networks trained to perform modular addition, we give the first complete characterization of the training dynamics, revealing that networks learn Fourier features in decreasing order of coefficient magnitude. Altogether, AGF offers a promising step towards understanding feature learning in neural networks.
Progress Reward Model for Reinforcement Learning via Large Language Models
Xiuhui Zhang · Ning Gao · Xingyu Jiang · Yihui Chen · Yuheng Pan · Mohan Zhang · Yue Deng
Traditional reinforcement learning (RL) algorithms face significant limitations in handling long-term tasks with sparse rewards. Recent advancements have leveraged large language models (LLMs) to enhance RL by utilizing their world knowledge for task planning and reward generation. However, planning-based approaches often depend on pre-defined skill libraries and fail to optimize low-level control policies, while reward-based methods require extensive human feedback or exhaustive searching due to the complexity of tasks. In this paper, we propose the Progress Reward Model for RL (PRM4RL), a novel framework that integrates task planning and dense reward to enhance RL. For high-level planning, a complex task is decomposed into a series of simple manageable subtasks, with a subtask-oriented, fine-grained progress function designed to monitor task execution progress. For low-level reward generation, inspired by potential-based reward shaping, we use the progress function to construct a Progress Reward Model (PRM), providing theoretically grounded optimality and convergence guarantees, thereby enabling effective policy optimization. Experimental results on robotics control tasks demonstrate that our approach outperforms both LLM-based planning and reward methods, achieving state-of-the-art performance.
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback
Wei Shen · Guanlin Liu · Yu Yue · Ruofei Zhu · Qingping Yang · Chao Xin · Lin Yan
Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences and values. While recent research has primarily focused on algorithmic advancements—such as reducing computational overhead or strengthening reward models to mitigate reward hacking—the critical role of prompt-data construction and its scalability has received comparatively less attention. In this paper, we address this gap by systematically exploring data-driven bottlenecks that currently hinder RLHF performance scaling, focusing specifically on the challenges posed by reward hacking and decreasing response diversity. To mitigate reward hacking, we introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM). This approach not only exhibits enhanced resistance to reward hacking, but also enables accurate assessment of responses against clearly defined ground-truth solutions. Additionally, in order to ensure response diversity and enhance learning effectiveness, we propose a novel prompt-selection method named \textbf{Pre-PPO}, explicitly identifying training prompts that are inherently challenging and thus less prone to reward hacking. Furthermore, we find that \textbf{prioritizing mathematical and coding tasks during the early phases of RLHF training} significantly boosts performance, given that these tasks naturally encode fine-grained response distinctions and possess clearly defined ground truths. Through comprehensive experiments conducted across two model sizes, we validate the effectiveness and scalability of our proposed methods. Results show that RTV exhibits the strongest resistance to reward hacking, followed by GenRM with ground truth, and finally GenRM relying on SFT Best-of-N responses. Moreover, our proposed strategies enable the model to rapidly capture subtle task-specific distinctions, leading to substantial improvements in overall RLHF performance. This work underscores the importance of careful data construction and provides practical methodologies to overcome critical performance barriers in RLHF.
Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Christian Walder · Deep Tejas Karkhanis
Reinforcement Learning algorithms commonly sample multiple ($n>1$) solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes individual sample performance over the diversity and collective utility of a set of samples. Such algorithms under-utilize the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-$k$ Policy Optimization (PKPO), a multivariate transformation on batches of rewards which leads to direct optimization of \passk\ performance, thus optimizing for sets of samples that feature a large maximum reward when considered jointly. Our primary contribution is to derive novel low variance unbiased estimators for the pass@k and its gradient, in both the binary and continuous reward settings. We show that optimizing with these estimators reduces to reinforcement learning with (batches of) rewards that have been jointly transformed by a function that is stable and efficient to compute. While previous efforts propose transformations for $k=n$, our transformations are the first to enable robust optimization of the pass@k for any arbitrary $k \leq n$. Rather than simply trading off pass@1 performance for pass@k gains, our method allows annealing $k$ during training, optimizing both metrics and often achieving strong pass@1 performance alongside significant pass@k gains. We validate our transformations on illustrative toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source models Gemma and Llama . We find that our transformation effectively optimizes for the target $k$. Furthermore, higher $k$ values enable solving more and harder problems, while annealing $k$ boosts both the pass@1 and pass@k. Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely by improving exploration through the prioritization of joint utility over the utility of individual samples
Reinforcement Learning with Action Chunking
Qiyang Li · Zhiyuan (Paul) Zhou · Sergey Levine
We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a *chunked* action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
RF-Agent: Automated Reward Function Design via Language Agent Tree Search
Ning Gao · Xiuhui Zhang · Xingyu Jiang · Mukang You · Mohan Zhang · Yue Deng
Designing efficient reward functions for low-level control tasks is a challenging problem. Recent research aims to reduce reliance on expert experience by using Large Language Models (LLMs) with task information to generate dense reward functions. These methods typically rely on training results as feedback, iteratively generating new reward functions with greedy or evolutionary algorithms. However, they suffer from poor utilization of historical feedback and inefficient search, resulting in limited improvements in complex control tasks. To address this challenge, we propose RF-Agent, a framework that treats LLMs as language agents and frames reward function design as a sequential decision-making process, enhancing optimization through better contextual reasoning. RF-Agent integrates Monte Carlo Tree Search (MCTS) to manage the reward design and optimization process, leveraging the multi-stage contextual reasoning ability of LLM. This approach better utilizes historical information and improves search efficiency to identify promising reward functions. Outstanding experimental results in 17 diverse low-level control tasks demonstrate the effectiveness of our method.
Flat Channels to Infinity in Neural Loss Landscapes
Flavio Martinelli · Alexander van Meegen · Berfin Simsek · Wulfram Gerstner · Johanni Brea
The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_i\sigma(\mathbf{w_i} \cdot \mathbf{x}) + a_j\sigma(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow c \sigma(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) \sigma'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of this quasi-flat region in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.
FedRW: Efficient Privacy-Preserving Data Reweighting for Enhancing Federated Learning of Language Models
Pukang Ye · Luo Junwei · Jiachen Shen · Saipan Zhou · Shangmin Dou · Zhenfu Cao · Hanzhe Yao · Xiaolei Dong · Yunbo Yang
Data duplication within large-scale corpora often impedes large language models' (LLMs) performance and privacy. In privacy-concerned federated learning scenarios, conventional deduplication methods typically rely on trusted third parties to perform uniform deletion, risking loss of informative samples while introducing privacy vulnerabilities. To address these gaps, we propose Federated ReWeighting (FedRW), the first privacy-preserving framework, to the best of our knowledge, that performs soft deduplication via sample reweighting instead of deletion in federated LLM training, without assuming a trusted third party. At its core, FedRW proposes a secure, frequency-aware reweighting protocol through secure multi-party computation, coupled with a parallel orchestration strategy to ensure efficiency and scalability. During training, FedRW utilizes an adaptive reweighting mechanism with global sample frequencies to adjust individual loss contributions, effectively improving generalization and robustness. Empirical results demonstrate that FedRW outperforms the state-of-the-art method by achieving up to $28.78\times$ speedup in preprocessing and approximately $11.42$\% improvement in perplexity, while offering enhanced security guarantees. FedRW thus establishes a new paradigm for managing duplication in federated LLM training.
Distribution-Aware Tensor Decomposition for Compression of Convolutional Neural Networks
Alper KALLE · Théo Rudkiewicz · Mohamed Ouerfelli · Mohamed Tamaazousti
Neural networks are widely used for image–related tasks but typically demand considerable computing power. Once a network has been trained, however, its memory‑ and compute‑footprint can be reduced by compression. In this work, we focus on compression through tensorization and low‑rank representations. Whereas classical approaches search for a low‑rank approximation by minimizing an isotropic norm such as the Frobenius norm in weight‑space, we use data‑informed norms that measure the error in function space. Concretely, we minimize the change in the layer’s output distribution, which can be expressed as $\lVert (W - \widetilde{W}) \Sigma^{1/2}\rVert_F$ where $\Sigma^{1/2}$ is the square root of the covariance matrix of the layer’s input and $W$, $\widetilde{W}$ are the original and compressed weights. We propose new alternating least square algorithms for the two most common tensor decompositions (Tucker‑2 and CPD) that directly optimize the new norm. Unlike conventional compression pipelines, which almost always require post‑compression fine‑tuning, our data‑informed approach often achieves competitive accuracy without any fine‑tuning. We further show that the same covariance‑based norm can be transferred from one dataset to another with only a minor accuracy drop, enabling compression even when the original training dataset is unavailable. Experiments on several CNN architectures (ResNet‑18/50, and GoogLeNet) and datasets (ImageNet, FGVC‑Aircraft, Cifar10, and Cifar100) confirm the advantages of the proposed method.
EventMG: Efficient Multilevel Mamba-Graph Learning for Spatiotemporal Event Representation
Sheng Wu · Lin Jin · Hui Feng · Bo Hu
Event cameras offer unique advantages in scenarios involving high speed, low light, and high dynamic range, yet their asynchronous and sparse nature poses significant challenges to efficient spatiotemporal representation learning. Specifically, despite notable progress in the field, effectively modeling the full spatiotemporal context, selectively attending to salient dynamic regions, and robustly adapting to the variable density and dynamic nature of event data remain key challenges. Motivated by these challenges, this paper proposes EventMG, a lightweight, efficient, multilevel Mamba-Graph architecture designed for learning high-quality spatiotemporal event representations. EventMG employs a multilevel approach, jointly modeling information at the micro (single event) and macro (event cluster) levels to comprehensively capture the multi-scale characteristics of event data. At the micro-level, it focuses on spatiotemporal details, employing State Space Model (SSM) based Mamba, to precisely capture long-range dependencies among numerous event nodes. Concurrently, at the macro-level, Component Graphs are introduced to efficiently encode the local semantics and global topology of dense event regions. Furthermore, to better accommodate the dynamic and sparse characteristics of data, we propose the Spatiotemporal-aware Event Scanning Technology (SEST), integrating the Adaptive Perturbation Network (APN) and Multidirectional Scanning Module (MSM), which substantially enhances the model's ability to perceive and focus on key spatiotemporal patterns. By employing this novel collaborative paradigm, EventMG demonstrates the ability to effectively capture multi-level spatiotemporal characteristics of event data while maintaining a low parameter count and linear computational complexity, suggesting a promising direction for event representation learning.
Mamba Modulation: On the Length Generalization of Mamba Models
Peng Lu · Jerry Huang · QIUHAO Zeng · Xinyu Wang · Boxing Chen · Philippe Langlais · Yufei CUI
The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba’s performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behavior of its state-space dynamics, particularly within the parameterization of the state transition matrix $A$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $\exp(-\sum_{t=1}^N{\Delta}_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $A$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $A$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating ${\Delta}_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.
Aligning Text-to-Image Diffusion Models to Human Preference by Classification
Longquan Dai · Xiaolu Wei · wang he · Shaomeng Wang · Jinhui Tang
Text-to-image diffusion models are typically trained on large-scale web data, often resulting in outputs that misalign with human preferences. Inspired by preference learning in large language models, we propose ABC (Alignment by Classification), a simple yet effective framework for aligning diffusion models with human preferences. In contrast to prior DPO-based methods that depend on suboptimal supervised fine-tuned (SFT) reference models, ABC assumes access to an ideal reference model perfectly aligned with human intent and reformulates alignment as a classification problem. Under this view, we recognize that preference data naturally forms a semi-supervised classification setting. To address this, we propose a data augmentation strategy that transforms preference comparisons into fully supervised training signals. We then introduce a classification-based ABC loss to guide alignment. Our alignment by classification approach could effectively steer the diffusion model toward the behavior of the ideal reference. Experiments on various diffusion models show that our ABC consistently outperforms existing baselines, offering a scalable and robust solution for preference-based text-to-image fine-tuning.
Small Singular Values Matter: A Random Matrix Analysis of Transformer Models
Max Staats · Matthias Thamm · Bernd Rosenow
This work analyzes singular-value spectra of weight matrices in pretrained transformer models to understand how information is stored at both ends of the spectrum. Using Random Matrix Theory (RMT) as a zero information hypothesis, we associate agreement with RMT as evidence of randomness and deviations as evidence for learning. Surprisingly, we observe pronounced departures from RMT not only among the largest singular values - the usual outliers - but also among the smallest ones. A comparison of the associated singular vectors with the eigenvectors of the activation covariance matrices shows that there is considerable overlap wherever RMT is violated. Thus, significant directions in the data are captured by small singular values and their vectors as well as by the large ones. We confirm this empirically: zeroing out the singular values that deviate from RMT raises language-model perplexity far more than removing values from the bulk, and after fine-tuning the smallest decile can be the third most influential part of the spectrum. To explain how vectors linked to small singular values can carry more information than those linked to larger values, we propose a linear random-matrix model. Our findings highlight the overlooked importance of the low end of the spectrum and provide theoretical and practical guidance for SVD-based pruning and compression of large language models.
GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization
Pengyue Jia · Seongheon Park · Song Gao · Xiangyu Zhao · Sharon Li
Worldwide image geolocalization—the task of predicting GPS coordinates from images taken anywhere on Earth—poses a fundamental challenge due to the vast diversity in visual content across regions. While recent approaches adopt a two-stage pipeline of retrieving candidates and selecting the best match, they typically rely on simplistic similarity heuristics and point-wise supervision, failing to model spatial relationships among candidates. In this paper, we propose GeoRanker, a distance-aware ranking framework that leverages large vision-language models to jointly encode query–candidate interactions and predict geographic proximity. In addition, we introduce a multi-order distance loss that ranks both absolute and relative distances, enabling the model to reason over structured spatial relationships. To support this, we curate GeoRanking, the first dataset explicitly designed for geographic ranking tasks with multimodal candidate information. GeoRanker achieves state-of-the-art results on two well-established benchmarks (IM2GPS3K and YFCC4K), significantly outperforming current best methods. We also release our code, checkpoint, and dataset online for ease of reproduction.
LLM Safety Alignment is Divergence Estimation in Disguise
Rajdeep Haldar · Ziyi Wang · Guang Lin · Yue XING · Qifan Song
We present a theoretical framework showing that popular LLM alignment methods—including RLHF and its variants—can be understood as divergence estimators between aligned (safe or preferred) and unaligned (harmful or less-preferred) distributions. This perspective explains the emergence of separation in the latent space between safe and harmful prompts after alignment. As an application of our general divergence framework, we propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance–refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety.
Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks
Hongyuan Tao · Ying Zhang · Zhenhao Tang · Hongen Peng · Xukun Zhu · Bingchang Liu · Yingguang Yang · Ziyin Zhang · Zhaogui Xu · Haipeng Zhang · Linchao Zhu · Rui Wang · Hang Yu · Jianguo Li · Peng Di
Recent advances in Large Language Models (LLMs) have shown promise in function-level code generation, yet repository-level software engineering tasks remain challenging. Current solutions predominantly rely on proprietary LLM agents, which introduce unpredictability and limit accessibility, raising concerns about data privacy and model customization. This paper investigates whether open-source LLMs can effectively address repository-level tasks without requiring agent-based approaches. We demonstrate this is possible by enabling LLMs to comprehend functions and files within codebases through their semantic information and structural dependencies. To this end, we introduce Code Graph Models (CGMs), which integrate repository code graph structures into the LLM's attention mechanism and map node attributes to the LLM's input space using a specialized adapter. When combined with an agentless graph RAG framework, our approach achieves a 43.00% resolution rate on the SWE-bench Lite benchmark using the open-source Qwen2.5-72B model. This performance ranks first among open weight models, second among methods with open-source systems, and eighth overall, surpassing the previous best open-source model-based method by 12.33%.
Recursive Inference Scaling: A Winning Path to Scalable Inference in Language and Multimodal Systems
Ibrahim Alabdulmohsin · Xiaohua Zhai
Inspired by recent findings on the fractal geometry of language, we introduce Recursive INference Scaling (RINS) as a complementary, plug-in recipe for scaling inference time in language and multimodal systems. RINS is a particular form of recursive depth that significantly outperforms +55 other variants, including the recent "repeat-all-over" (RAO) strategy in Mobile LLM (Liu et al., 2024) and latent recurrent thinking (Geiping et al., 2025). Unlike prior works, we carry out our comparisons on a compute-matched regime, and demonstrate that for a fixed model size and training compute budget, RINS substantially improves language modeling performance. It also generalizes beyond pure language tasks, delivering gains in multimodal systems, including a +2% improvement in 0-shot ImageNet accuracy for SigLIP-B/16. Additionally, by deriving data scaling laws, we show that RINS improves both the asymptotic performance limits and the scaling exponents. More importantly, with light-weight (linear) adapters (comprising <1% of model parameters) and stochastic dropout, RINS offers a no-regret strategy, meaning that RINS-enabled pretraining improves performance in language modeling even when recursive depth is not applied at inference time. This corresponds to improving performance on a training compute-, parameter-, and inference-matched regime, suggesting its potential as a viable component of LLM pretraining!
Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning
Lei Wang · Jieming Bian · Letian Zhang · Jie Xu
Large Language Models (LLMs) have demonstrated impressive capabilities across various tasks, but fine-tuning them for domain-specific applications often requires substantial domain-specific data that may be distributed across multiple organizations. Federated Learning (FL) offers a privacy-preserving solution, but faces challenges with computational constraints when applied to LLMs. Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient fine-tuning approach, though a single LoRA module often struggles with heterogeneous data across diverse domains. This paper addresses two critical challenges in federated LoRA fine-tuning: 1. determining the optimal number and allocation of LoRA experts across heterogeneous clients, and 2. enabling clients to selectively utilize these experts based on their specific data characteristics. We propose FedLEASE (Federated adaptive LoRA Expert Allocation and SElection), a novel framework that adaptively clusters clients based on representation similarity to allocate and train domain-specific LoRA experts. It also introduces an adaptive top-$M$ Mixture-of-Experts mechanism that allows each client to select the optimal number of utilized experts. Our extensive experiments on diverse benchmark datasets demonstrate that FedLEASE significantly outperforms existing federated fine-tuning approaches in heterogeneous client settings while maintaining communication efficiency.
Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation
Yifu Luo · Xinhao Hu · Keyu Fan · Haoyuan Sun · Zeyu Chen · Bo Xia · Tiantian Zhang · Yongzhe Chang · Xueqian Wang
Reinforcement learning (RL) has garnered increasing attention in text-to-image (T2I) generation. However, most existing RL approaches are tailored to either diffusion models or autoregressive models, overlooking an important alternative: masked generative models. In this work, we propose Mask-GRPO, the first method to incorporate Group Relative Policy Optimization (GRPO)-based RL into this overlooked paradigm. Our core insight is to redefine the transition probability, which is different from current approaches, and formulate the unmasking process as a multi-step decision-making problem. To further enhance our method, we explore several useful strategies, including removing the Kullback–Leibler constraint, applying the reduction strategy, and filtering out low-quality samples. Using Mask-GRPO, we improve a base model, Show-o, with substantial improvements on standard T2I benchmarks and preference alignment, outperforming existing state-of-the-art approaches.
EchoShot: Multi-Shot Portrait Video Generation
Jiahao Wang · Hualian Sheng · Sijia Cai · Weizhan Zhang · Caixia Yan · Yachuang Feng · Bing Deng · Jieping Ye
Video diffusion models substantially boost the productivity of artistic workflows with high-quality portrait video generative capacity. However, prevailing pipelines are primarily constrained to single-shot creation, while real-world applications urge for multiple shots with identity consistency and flexible content controllability. In this work, we propose EchoShot, a native and scalable multi-shot framework for portrait customization built upon a foundation video diffusion model. To start with, we propose shot-aware position embedding mechanisms within video diffusion transformer architecture to model inter-shot variations and establish intricate correspondence between multi-shot visual content and their textual descriptions. This simple yet effective design enables direct training on multi-shot video data without introducing additional computational overhead. To facilitate model training within multi-shot scenario, we construct PortraitGala, a large-scale and high-fidelity human-centric video dataset featuring cross-shot identity consistency and fine-grained captions such as facial attributes, outfits, and dynamic motions. To further enhance applicability, we extend EchoShot to perform reference image-based personalized multi-shot generation and long video synthesis with infinite shot counts. Extensive evaluations demonstrate that EchoShot achieves superior identity consistency as well as attribute-level controllability in multi-shot portrait video generation. Notably, the proposed framework demonstrates potential as a foundational paradigm for general multi-shot video modeling. Project page: https://johnneywang.github.io/EchoShot-webpage.
CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step
Zheyuan Liu · Munan Ning · Qihui Zhang · Shuo Yang · Zhongrui Wang · Yiwei Yang · Xianzhe Xu · Yibing Song · Weihua Chen · Fan Wang · Li Yuan
Current text-to-image (T2I) generation models struggle to align spatial composition with the input text, especially in complex scenes. Even layout-based approaches yield suboptimal spatial control, as their generation process is decoupled from layout planning, making it difficult to refine the layout during synthesis. We present CoT-Diff, a framework that brings step-by-step CoT-style reasoning into T2I generation by tightly integrating Multimodal Large Language Model (MLLM)-driven 3D layout planning with the diffusion process. CoT-Diff enables layout-aware reasoning inline within a single diffusion round: at each denoising step, the MLLM evaluates intermediate predictions, dynamically updates the 3D scene layout, and continuously guides the generation process. The updated layout is converted into semantic conditions and depth maps, which are fused into the diffusion model via a condition-aware attention mechanism, enabling precise spatial control and semantic injection. Experiments on 3D Scene benchmarks show that CoT-Diff significantly improves spatial alignment and compositional fidelity, and outperforms the state-of-the-art method by 34.7% in complex scene spatial accuracy, thereby validating the effectiveness of this entangled generation paradigm.
OmniZoom: A Universal Plug-and-Play Paradigm for Cross-Device Smooth Zoom Interpolation
Xiaoan Zhu · Yue Zhao · Tianyang Hu · Jiaming Guo · Yulan Zeng · Renjing Pei · Fenglong Song · Huajun Feng
Dual-camera smartphones suffer from geometric and photometric inconsistencies during zoom transitions, primarily due to disparities in intrinsic/extrinsic parameters and divergent image processing pipelines between the two cameras. Existing interpolation methods struggle to effectively address this issue, constrained by the lack of ground-truth datasets and motion ambiguity in dynamic scenarios. To overcome these challenges, we propose OmniZoom, a universal plug-and-play paradigm for cross-device smooth zoom interpolation. Specifically, we present a novel cross-device virtual data generation method utilizing 3D Gaussian Splatting. This method tackles data scarcity by decoupling geometric features via spatial transition modeling and correcting photometric variations with dynamic color adaptation. It is further enhanced by cross-domain consistency learning for device-agnostic semantic alignment. Additionally, we introduce a plug-and-play 3D-TPR (3D Trajectory Progress Ratio Mapping) framework that surmounts 2D spatial limitations. As components of our framework, a texture-focus strategy is introduced for high-frequency detail preservation, incorporating mask penalty constraints to suppress interpolation artifacts. Our pipeline exhibits broad compatibility with diverse interpolation methods and achieves good performance across multiple public benchmarks. Real-world evaluations on various smartphone platforms also reveal significant quality improvements after fine-tuning on our synthetic data, which underscores the robustness and practical effectiveness of our approach for cross-device zoom applications.
Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer
Zechuan Zhang · Ji Xie · Yu Lu · Zongxin Yang · Yi Yang
Instruction-based image editing enables precise modifications via natural language prompts, but existing methods face a precision-efficiency tradeoff: fine-tuning demands massive datasets (>10M) and computational resources, while training-free approaches suffer from weak instruction comprehension. We address this by proposing \textbf{ICEdit}, which leverages the inherent comprehension and generation abilities of large-scale Diffusion Transformers (DiTs) through three key innovations: (1) An in-context editing paradigm without architectural modifications; (2) Minimal parameter-efficient fine-tuning for quality improvement; (3) Early Filter Inference-Time Scaling, which uses VLMs to select high-quality noise samples for efficiency. Experiments show that ICEdit achieves state-of-the-art editing performance with only 0.1\% of the training data and 1\% trainable parameters compared to previous methods. Our approach establishes a new paradigm for balancing precision and efficiency in instructional image editing.
MJ-Video: Benchmarking and Rewarding Video Generation with Fine-Grained Video Preference
Haibo Tong · Zhaoyang Wang · Zhaorun Chen · Haonian Ji · Shi Qiu · Siwei Han · Kexin Geng · Zhongkai Xue · Yiyang Zhou · Peng Xia · Mingyu Ding · Rafael Rafailov · Chelsea Finn · Huaxiu Yao
Recent advancements in video generation have significantly improved the ability to synthesize videos from text instructions. However, existing models still struggle with key challenges such as instruction misalignment, content hallucination, safety concerns, and generation bias. To address these limitations, we introduce MJ-BENCH-VIDEO, a large-scale video preference benchmark designed to evaluate video generation across five critical aspects: Alignment, Safety, Fineness, Coherence & Consistency, and Bias & Fairness. This benchmark further incorporates 28 fine-grained criteria to provide a comprehensive evaluation of video preference. Building upon this dataset, we propose MJ-VIDEO, a Mixture-of-Experts (MoE)-based video reward model designed to deliver fine-grained reward. MJ-VIDEO can dynamically select relevant experts to accurately judge the preference based on the input text-video pair. This architecture enables more precise and adaptable preference judgments. Through extensive benchmarking on MJ-BENCH-VIDEO, we analyze the limitations of existing video reward models and demonstrate the superior performance of MJ-VIDEO in video preference assessment, achieving 17.58% and 15.87% improvements in overall and fine-grained preference judgments, respectively. Additionally, MJ-VIDEO is able to improve the alignment performance in video generation via preference fine-tuning.
Tortoise and Hare Guidance: Accelerating Diffusion Model Inference with Multirate Integration
Yunghee Lee · Byeonghyun Pak · Junwha Hong · Hoseong Kim
In this paper, we propose Tortoise and Hare Guidance (THG), a training-free strategy that accelerates diffusion sampling while maintaining high-fidelity generation. We demonstrate that the noise estimate and the additional guidance term exhibit markedly different sensitivity to numerical error by reformulating the classifier-free guidance (CFG) ODE as a multirate system of ODEs. Our error-bound analysis shows that the additional guidance branch is more robust to approximation, revealing substantial redundancy that conventional solvers fail to exploit. Building on this insight, THG significantly reduces the computation of the additional guidance: the noise estimate is integrated with the tortoise equation on the original, fine-grained timestep grid, while the additional guidance is integrated with the hare equation only on a coarse grid. We also introduce (i) an error-bound-aware timestep sampler that adaptively selects step sizes and (ii) a guidance-scale scheduler that stabilizes large extrapolation spans. THG reduces the number of function evaluations (NFE) by up to 30% with virtually no loss in generation fidelity (∆ImageReward ≤ 0.032) and outperforms state-of-the-art CFG-based training-free accelerators under identical computation budgets. Our findings highlight the potential of multirate formulations for diffusion solvers, paving the way for real-time high-quality image synthesis without any model retraining. The source code is available at https://github.com/yhlee-add/THG.
CamEdit: Continuous Camera Parameter Control for Photorealistic Image Editing
Xinran Qin · Zhixin Wang · Fan Li · Haoyu Chen · Renjing Pei · Wenbo Li · Xiaochun Cao
Recent advances in diffusion models have substantially improved text-driven image editing. However, existing frameworks based on discrete textual tokens struggle to support continuous control over camera parameters and smooth transitions in visual effects. These limitations hinder their applications to realistic, camera-aware, and fine-grained editing tasks. In this paper, we present CamEdit, a diffusion-based framework for photorealistic image editing that enables continuous and semantically meaningful manipulation of common camera parameters such as aperture and shutter speed. CamEdit incorporates a continuous parameter prompting mechanism and a parameter-aware modulation module that guides the model in smoothly adjusting focal plane, aperture, and shutter speed, reflecting the effects of varying camera settings within the diffusion process. To support supervised learning in this setting, we introduce CamEdit50K, a dataset specifically designed for photorealistic image editing with continuous camera parameter settings. It contains over 50k image pairs combining real and synthetic data with dense camera parameter variations across diverse scenes. Extensive experiments demonstrate that CamEdit enables flexible, consistent, and high-fidelity image editing, achieving state-of-the-art performance in camera-aware visual manipulation and fine-grained photographic control.
UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting
Kai He · Ruofan Liang · Jacob Munkberg · Jon Hasselgren · Nandita Vijaykumar · Alexander Keller · Sanja Fidler · Igor Gilitschenski · Zan Gojcic · Zian Wang
We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency. Our project page is https://research.nvidia.com/labs/toronto-ai/UniRelight/.
VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation
Wenhao Wang · Yi Yang
Text-to-video generative models convert textual prompts into dynamic visual content, offering wide-ranging applications in film production, gaming, and education. However, their real-world performance often falls short of user expectations. One key reason is that these models have not been trained on videos related to some topics users want to create. In this paper, we propose VideoUFO, the first Video dataset specifically curated to align with Users' FOcus in real-world scenarios. Beyond this, our VideoUFO also features: (1) minimal (0.29\%) overlap with existing video datasets, and (2) videos searched exclusively via YouTube's official API under the Creative Commons license. These two attributes provide future researchers with greater freedom to broaden their training sources. The VideoUFO comprises over 1.09 million video clips, each paired with both a brief and a detailed caption (description). Specifically, through clustering, we first identify 1,291 user-focused topics from the million-scale real text-to-video prompt dataset, VidProM. Then, we use these topics to retrieve videos from YouTube, split the retrieved videos into clips, and generate both brief and detailed captions for each clip. After verifying the clips with specified topics, we are left with about 1.09 million video clips. Our experiments reveal that (1) current 16 text-to-video models do not achieve consistent performance across all user-focused topics; and (2) a simple model trained on VideoUFO outperforms others on worst-performing topics. The dataset and code are publicly available at https://huggingface.co/datasets/WenhaoWang/VideoUFO and https://github.com/WangWenhao0716/BenchUFO under the CC BY 4.0 License.
FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges
Kevin Hayes · Micah Goldblum · Vikash Sehwag · Gowthami Somepalli · Ashwinee Panda · Tom Goldstein
Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium, SD3.5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.
vHector and HeisenVec: Scalable Vector Graphics Generation Through Large Language Models
Leonardo Zini · Elia Frigieri · Sebastiano Aloscari · Lorenzo Baraldi
We introduce HeisenVec, a large-scale dataset designed to advance research in vector graphics generation from natural language descriptions. Unlike conventional image generation datasets that focus on raster images, HeisenVec targets the structured and symbolic domain of Scalable Vector Graphics (SVG), where images are represented as sequences of drawing commands and style attributes. The dataset comprises 2.2 million SVGs collected from different online sources, each paired with four complementary textual descriptions generated by multi-modal models. To ensure structural consistency and efficiency for autoregressive modeling, all SVGs are standardized through a pre-processing pipeline that unifies geometric primitives as paths, applies affine transformations, and compresses syntax via custom tokens set. HeisenVec exhibits broad coverage among visual styles and sequence lengths, with a substantial portion of samples exceeding 8,000 tokens, making it particularly well-suited for benchmarking long-context language models. Our benchmark enables rigorous evaluation of text-conditioned SVG generation, encourages progress on sequence modeling with symbolic outputs, and bridges the gap between vision, graphics, and language. We release the dataset, tokenization tools, and evaluation pipeline to foster further research in this emerging domain.
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation
Jingjing Chang · Yixiao Fang · Peng Xing · Shuhan Wu · Wei Cheng · Rui Wang · Xianfang Zeng · Gang Yu · Hai-Bao Chen
Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts. However, rapid T2I model advancements reveal limitations in early benchmarks, lacking comprehensive evaluations, especially for text rendering and style. Notably, recent state-of-the-art models, with their rich knowledge modeling capabilities, show potential in reasoning-driven image generation, yet existing evaluation systems have not adequately addressed this frontier. To systematically address these gaps, we introduce $\textbf{OneIG-Bench}$, a meticulously designed comprehensive benchmark framework for fine-grained evaluation of T2I models across multiple dimensions, including subject-element alignment, text rendering precision, reasoning-generated content, stylization, and diversity. By structuring the evaluation, this benchmark enables in-depth analysis of model performance, helping researchers and practitioners pinpoint strengths and bottlenecks in the full pipeline of image generation. Our codebase and dataset are now publicly available to facilitate reproducible evaluation studies and cross-model comparisons within the T2I research community.
Sekai: A Video Dataset towards World Exploration
Zhen Li · Chuanhao Li · Xiaofeng Mao · Shaoheng Lin · Ming Li · Shitian Zhao · Zhaopan Xu · Xinyue Li · Yukang Feng · Jianwen Sun · Zizhen Li · Fanrui Zhang · Jiaxin Ai · Zhixiang Wang · Yuwei Wu · Tong He · Yunde Jia · Kaipeng Zhang
Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration.However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world.In this paper, we introduce Sekai (meaning "world" in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories.Comprehensive analyses and experiments demonstrate the dataset’s scale, diversity, annotation quality, and effectiveness for training video generation models.We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.
An Analysis of Concept Bottleneck Models: Measuring, Understanding, and Mitigating the Impact of Noisy Annotations
Seonghwan Park · Jueun Mun · Donghyun Oh · Namhoon Lee
Concept bottleneck models (CBMs) ensure interpretability by decomposing predictions into human interpretable concepts. Yet the annotations used for training CBMs that enable this transparency are often noisy, and the impact of such corruption is not well understood. In this study, we present the first systematic study of noise in CBMs and show that even moderate corruption simultaneously impairs prediction performance, interpretability, and the intervention effectiveness. Our analysis identifies a susceptible subset of concepts whose accuracy declines far more than the average gap between noisy and clean supervision and whose corruption accounts for most performance loss. To mitigate this vulnerability we propose a two-stage framework. During training, sharpness-aware minimization stabilizes the learning of noise-sensitive concepts. During inference, where clean labels are unavailable, we rank concepts by predictive entropy and correct only the most uncertain ones, using uncertainty as a proxy for susceptibility. Theoretical analysis and extensive ablations elucidate why sharpness-aware training confers robustness and why uncertainty reliably identifies susceptible concepts, providing a principled basis that preserves both interpretability and resilience in the presence of noise.
Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression
Xiaohui Wang · Peng Ye · Chenyu Huang · Shenghe Zheng · Bo Zhang · LEI BAI · Wanli Ouyang · Tao Chen
With the rise of the fine-tuned–pretrained paradigm, storing numerous fine-tuned models for multi-tasking creates significant storage overhead. Delta compression alleviates this by storing only the pretrained model and the highly compressed delta weights (the differences between fine-tuned and pretrained model weights). However, existing methods fail to maintain both high compression and performance, and often rely on data. To address these challenges, we propose UltraDelta, the first data-free delta compression pipeline that achieves both ultra-high compression and strong performance. UltraDelta is designed to minimize redundancy, maximize information, and stabilize performance across inter-layer, intra-layer, and global dimensions, using three key components: (1) Variance-Based Mixed Sparsity Allocation assigns sparsity based on variance, giving lower sparsity to high-variance layers to preserve inter-layer information. (2) Distribution-Aware Compression applies uniform quantization and then groups parameters by value, followed by group-wise pruning, to better preserve intra-layer distribution. (3) Trace-Norm-Guided Rescaling uses the trace norm of delta weights to estimate a global rescaling factor, improving model stability under higher compression. Extensive experiments across (a) large language models (fine-tuned on LLaMA-2 7B and 13B) with up to 50$\times$ compression, (b) general NLP models (RoBERTa-base, T5-base) with up to 224$\times$ compression, (c) vision models (ViT-B/32, ViT-L/14) with up to 132$\times$ compression, and (d) multi-modal models (BEiT-3) with 18$\times$ compression, demonstrate that UltraDelta consistently outperforms existing methods, especially under ultra-high compression. Code is available at https://github.com/xiaohuiwang000/UltraDelta.
ConTextTab: A Semantics-Aware Tabular In-Context Learner
Marco Spinaci · Marek Polewczyk · Maximilian Schambach · Sam Thelin
Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. Although current table-native ICL architectures are architecturally efficient and well-adapted to tabular data structures, their exclusive training on synthetic data limits their ability to fully leverage the rich semantics and world knowledge contained in real-world tabular data. At the other end of the spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark. Code and model checkpoints are available at: https://github.com/SAP-samples/contexttab .
SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models
Emil Biju · Shayan Talaei · Zhemin Huang · Mohammadreza Pourreza · Azalia Mirhoseini · Amin Saberi
Large reasoning models (LRMs) excel at complex reasoning tasks but typically generate lengthy sequential chains-of-thought, resulting in long inference times before arriving at the final answer. To address this challenge, we introduce SPRINT, a novel post-training and inference-time framework designed to enable LRMs to dynamically identify and exploit opportunities for parallelization during their reasoning process. SPRINT incorporates an innovative data curation pipeline that reorganizes natural language reasoning trajectories into structured rounds of long-horizon planning and parallel execution. By fine-tuning LRMs on a small amount of such curated data, the models learn to dynamically identify independent subtasks within extended reasoning processes and effectively execute them in parallel. Through extensive evaluations, we demonstrate that models fine-tuned with the SPRINT framework match the performance of reasoning models on complex domains such as mathematics while generating up to 39% fewer sequential tokens on problems requiring more than 8,000 output tokens. Finally, we observe consistent results transferred to two out-of-distribution tasks, namely GPQA and Countdown, with up to 45% and 65% reduction in average sequential tokens respectively for longer reasoning trajectories, while matching the performance of the fine-tuned reasoning model.
REPA Works Until It Doesn’t: Early-Stopped, Holistic Alignment Supercharges Diffusion Training
Ziqiao Wang · Wangbo Zhao · Yuhao Zhou · Zekai Li · Zhiyuan Liang · Mingjia Shi · Xuanlei Zhao · Pengfei Zhou · Kaipeng Zhang · Zhangyang "Atlas" Wang · Kai Wang · Yang You
Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy---representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g., DINO)---dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to the capacity mismatch: once the generative student begins modeling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256×256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA’s best FID in 500 epochs, amounting to a 28× reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, proving to be a simple yet principled recipe for efficient diffusion training across various tasks.
Think before Recommendation: Autonomous Reasoning-enhanced Recommender
Xiaoyu Kong · Junguang Jiang · Bin Liu · Ziru Xu · Han Zhu · Jian Xu · Bo Zheng · Jiancan Wu · Xiang Wang
The core task of recommender systems is to learn user preferences from historical user-item interactions. With the rapid development of large language models (LLMs), recent research has explored leveraging the reasoning capabilities of LLMs to enhance rating prediction tasks. However, existing distillation-based methods suffer from limitations such as the teacher model's insufficient recommendation capability, costly and static supervision, and superficial transfer of reasoning ability. To address these issues, this paper proposes RecZero, a reinforcement learning (RL)-based recommendation paradigm that abandons the traditional multi-model and multi-stage distillation approach. Instead, RecZero trains a single LLM through pure RL to autonomously develop reasoning capabilities for rating prediction. RecZero consists of two key components: (1) "Think-before-Recommendation" prompt construction, which employs a structured reasoning template to guide the model in step-wise analysis of user interests, item features, and user-item compatibility; and (2) rule-based reward modeling, which adopts group relative policy optimization (GRPO) to compute rewards for reasoning trajectories and optimize the LLM. Additionally, the paper explores a hybrid paradigm, RecOne, which combines supervised fine-tuning with RL, initializing the model with cold-start reasoning samples and further optimizing it with RL. Experimental results demonstrate that RecZero and RecOne significantly outperform existing baseline methods on multiple benchmark datasets, validating the superiority of the RL paradigm in achieving autonomous reasoning-enhanced recommender systems.
Recent advances in vision-language models have facilitated progress in sketch generation. However, existing specialized methods primarily focus on generic synthesis and lack mechanisms for precise control over sketch styles. In this work, we propose a training-free framework based on diffusion models that enables explicit style guidance via textual prompts and referenced style sketches. Unlike previous style transfer methods that overwrite key and value matrices in self-attention, we incorporate the reference features as auxiliary information with linear smoothing and leverage a style-content guidance mechanism. This design effectively reduces content leakage from reference sketches and enhances synthesis quality, especially in cases with low structural similarity between reference and target sketches. Furthermore, we extend our framework to support controllable multi-style generation by integrating features from multiple reference sketches, coordinated via a joint AdaIN module. Extensive experiments demonstrate that our approach achieves high-quality sketch generation with accurate style alignment and improved flexibility in style control. The official implementation of M3S is available at https://github.com/CMACH508/M3S.
FreqExit: Enabling Early-Exit Inference for Visual Autoregressive Models via Frequency-Aware Guidance
Ying Li · Chengfei Lyu · Huan Wang
Visual AutoRegressive (VAR) modeling employs a next-scale decoding paradigm that progresses from coarse structures to fine details. While enhancing fidelity and scalability, this approach challenges two fundamental assumptions of conventional dynamic inference: semantic stability (intermediate outputs approximating final results) and monotonic locality (smooth representation evolution across layers), which renders existing dynamic inference methods ineffective for VAR models. To address this challenge, we propose FreqExit, a unified training framework that enables dynamic inference in VAR without altering its architecture or compromising output quality. FreqExit is based on a key insight: high-frequency details are crucial for perceptual quality and tend to emerge only in later decoding stages. Leveraging this insight, we design targeted mechanisms that guide the model to learn more effectively through frequency-aware supervision. The proposed framework consists of three components: (1) a curriculum-based supervision strategy with progressive layer dropout and early exit loss; (2) a wavelet-domain high-frequency consistency loss that aligns spectral content across different generation steps; and (3) a lightweight self-supervised frequency-gated module that guides adaptive learning of both structural and detailed spectral components. On ImageNet 256×256, FreqExit achieves up to 2× speedup with only minor degradation, and delivers 1.3× acceleration without perceptible quality loss. This enables runtime-adaptive acceleration within a unified model, offering a favorable trade-off between efficiency and fidelity for for practical and flexible deployment.
REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing
Weihan Xu · Yimeng Ma · Jingyue Huang · Yang Li · Wenye Ma · Taylor Berg-Kirkpatrick · Julian McAuley · Paul Liang · Hao-Wen Dong
Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.
Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Linear Extrapolation
Jiawei Zhang · Ziyuan Liu · Leon Yan · Gen Li · Yuantao Gu
Diffusion-based inverse algorithms have shown remarkable performance across various inverse problems, yet their reliance on numerous denoising steps incurs high computational costs. While recent developments of fast diffusion ODE solvers offer effective acceleration for diffusion sampling without observations, their application in inverse problems remains limited due to the heterogeneous formulations of inverse algorithms and their prevalent use of approximations and heuristics, which often introduce significant errors that undermine the reliability of analytical solvers. In this work, we begin with an analysis of ODE solvers for inverse problems that reveals a linear combination structure of approximations for the inverse trajectory. Building on this insight, we propose a canonical form that unifies a broad class of diffusion-based inverse algorithms and facilitates the design of more generalizable solvers. Inspired by the linear subspace search strategy, we propose Learnable Linear Extrapolation (LLE), a lightweight approach that universally enhances the performance of any diffusion-based inverse algorithm conforming to our canonical form. LLE optimizes the combination coefficients to refine current predictions using previous estimates, alleviating the sensitivity of analytical solvers for inverse algorithms. Extensive experiments demonstrate consistent improvements of the proposed LLE method across multiple algorithms and tasks, indicating its potential for more efficient solutions and boosted performance of diffusion-based inverse algorithms with limited steps. Codes for reproducing our experiments are available at https://github.com/weigerzan/LLEinverseproblem.
ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models
Zixun Fang · Kai Zhu · Zhiheng Liu · Yu Liu · Wei Zhai · Yang Cao · Zheng-Jun Zha
Panoramic video generation aims to synthesize 360-degree immersive videos, holding significant importance in the fields of VR, world models, and spatial intelligence. Existing works fail to synthesize high-quality panoramic videos due to the inherent modality gap between panoramic data and perspective data, which constitutes the majority of the training data for modern diffusion models. In this paper, we propose a novel framework utilizing pretrained perspective video models for generating panoramic videos. Specifically, we design a novel panorama representation named ViewPoint map, which possesses global spatial continuity and fine-grained visual details simultaneously. With our proposed Pano-Perspective attention mechanism, the model benefits from pretrained perspective priors and captures the panoramic spatial correlations of the ViewPoint map effectively. Extensive experiments demonstrate that our method can synthesize highly dynamic and spatially consistent panoramic videos, achieving state-of-the-art performance and surpassing previous methods.
Seg4Diff: Unveiling Open-Vocabulary Semantic Segmentation in Text-to-Image Diffusion Transformers
Chaehyun Kim · Heeseong Shin · Eunbeen Hong · Heeji Yoon · Anurag Arnab · Paul Hongsuck Seo · Sunghwan Hong · Seungryong Kim
Text-to-image diffusion models excel at translating language prompts into photorealistic images by implicitly grounding textual concepts through their cross-modal attention mechanisms. Recent multi-modal diffusion transformers extend this by introducing joint self-attention over concatenated image and text tokens, enabling richer and more scalable cross-modal alignment. However, a detailed understanding of how and where these attention maps contribute to image generation remains limited. In this paper, we introduce Seg4Diff (Segmentation for Diffusion), a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image. Through comprehensive analysis, we identify a semantic grounding expert layer, a specific MM-DiT block that consistently aligns text tokens with spatially coherent image regions, naturally producing high-quality semantic segmentation masks. We further demonstrate that applying a lightweight fine-tuning scheme with mask-annotated image data enhances the semantic grouping capabilities of these layers and thereby improves both segmentation performance and generated image fidelity. Our findings demonstrate that semantic grouping is an emergent property of diffusion transformers and can be selectively amplified to advance both segmentation and generation performance, paving the way for unified models that bridge visual perception and generation.
Video World Models with Long-term Spatial Memory
Tong Wu · Shuai Yang · Ryan Po · Yinghao Xu · Ziwei Liu · Dahua Lin · Gordon Wetzstein
Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.
Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation
Hao Zhang · Chun-Han Yao · Simon Donné · Narendra Ahuja · Varun Jampani
We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts --- structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latents VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing. A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL, each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.
DCI: Dual-Conditional Inversion for Boosting Diffusion-Based Image Editing
Zixiang Li · Haoyu Wang · Wei Wang · Chuangchuang Tan · Yunchao Wei · Yao Zhao
Diffusion models have achieved remarkable success in image generation and editing tasks. Inversion within these models aims to recover the latent noise representation for a real or generated image, enabling reconstruction, editing, and other downstream tasks. However, to date, most inversion approaches suffer from an intrinsic trade-off between reconstruction accuracy and editing flexibility. This limitation arises from the difficulty of maintaining both semantic alignment and structural consistency during the inversion process. In this work, we introduce Dual-Conditional Inversion (DCI), a novel framework that jointly conditions on the source prompt and reference image to guide the inversion process. Specifically, DCI formulates the inversion process as a dual-condition fixed-point optimization problem, minimizing both the latent noise gap and the reconstruction error under the joint guidance. This design anchors the inversion trajectory in both semantic and visual space, leading to more accurate and editable latent representations. Our novel setup brings new understanding to the inversion process. Extensive experiments demonstrate that DCI achieves state-of-the-art performance across multiple editing tasks, significantly improving both reconstruction quality and editing precision. Furthermore, we also demonstrate that our method achieves strong results in reconstruction tasks, implying a degree of robustness and generalizability approaching the ultimate goal of the inversion process. Our codes are available at: https://github.com/Lzxhh/Dual-Conditional-Inversion
DNAEdit: Direct Noise Alignment for Text-Guided Rectified Flow Editing
Chenxi Xie · Minghan Li · Shuai Li · Yuhui Wu · Qiaosi Yi · Lei Zhang
Leveraging the powerful generation capability of large-scale pretrained text-to-image models, training-free methods have demonstrated impressive image editing results. Conventional diffusion-based methods, as well as recent rectified flow (RF)-based methods, typically reverse synthesis trajectories by gradually adding noise to clean images, during which the noisy latent at the current timestep is used to approximate that at the next timesteps, introducing accumulated drift and degrading reconstruction accuracy. Considering the fact that in RF the noisy latent is estimated through direct interpolation between Gaussian noises and clean images at each timestep, we propose Direct Noise Alignment (DNA), which directly refines the desired Gaussian noise in the noise domain, significantly reducing the error accumulation in previous methods. Specifically, DNA estimates the velocity field of the interpolated noised latent at each timestep and adjusts the Gaussian noise by computing the difference between the predicted and expected velocity field. We validate the effectiveness of DNA and reveal its relationship with existing RF-based inversion methods. Additionally, we introduce a Mobile Velocity Guidance (MVG) to control the target prompt-guided generation process, balancing image background preservation and target object editability. DNA and MVG collectively constitute our proposed method, namely DNAEdit. Finally, we introduce DNA-Bench, a long-prompt benchmark, to evaluate the performance of advanced image editing models. Experimental results demonstrate that our DNAEdit achieves superior performance to state-of-the-art text-guided editing methods. Our code, model, and benchmark will be made publicly available.
RepLDM: Reprogramming Pretrained Latent Diffusion Models for High-Quality, High-Efficiency, High-Resolution Image Generation
Boyuan Cao · Jiaxin Ye · Yujie Wei · Hongming Shan
While latent diffusion models (LDMs), such as Stable Diffusion, are designed for high-resolution image generation, they often struggle with significant structural distortions when generating images at resolutions higher than their training one. Instead of relying on extensive retraining, a more resource-efficient approach is to reprogram the pretrained model for high-resolution (HR) image generation; however, existing methods often result in poor image quality and long inference time. We introduce RepLDM, a novel reprogramming framework for pretrained LDMs that enables high-quality, high-efficiency, high-resolution image generation; see Fig. 1. RepLDM consists of two stages: (i) an attention guidance stage, which generates a latent representation of a higher-quality training-resolution image using a novel parameter-free self-attention mechanism to enhance the structural consistency; and (ii) a progressive upsampling stage, which progressively performs upsampling in pixel space to mitigate the severe artifacts caused by latent space upsampling. The effective initialization from the first stage allows for denoising at higher resolutions with significantly fewer steps, improving the efficiency. Extensive experimental results demonstrate that RepLDM significantly outperforms state-of-the-art methods in both quality and efficiency for HR image generation, underscoring its advantages for real-world applications. Codes: https://github.com/kmittle/RepLDM.
When Are Concepts Erased From Diffusion Models?
Kevin Lu · Nicky Kriplani · Rohit Gandikota · Minh Pham · David Bau · Chinmay Hegde · Niv Cohen
In concept erasure, a model is modified to selectively prevent it from generating a target concept. Despite the rapid development of new methods, it remains unclear how thoroughly these approaches remove the target concept from the model. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) interfering with the model’s internal guidance processes, and (ii) reducing the unconditional likelihood of generating the target concept, potentially removing it entirely. To assess whether a concept has been truly erased from the model, we introduce a comprehensive suite of independent probing techniques: supplying visual context, modifying the diffusion trajectory, applying classifier guidance, and analyzing the model's alternative generations that emerge in place of the erased concept. Our results shed light on the value of exploring concept erasure robustness outside of adversarial text inputs, and emphasize the importance of comprehensive evaluations for erasure in diffusion models.
StateSpaceDiffuser: Bringing Long Context to Diffusion World Models
Nedko Savov · Naser Kazemi · Deheng Zhang · Danda Pani Paudel · Xi Wang · Luc V Gool
World models have recently gained prominence for action-conditioned visual prediction in complex environments. However, relying on only a few recent observations causes them to lose long-term context. Consequently, within a few steps, the generated scenes drift from what was previously observed, undermining temporal coherence. This limitation, common in state-of-the-art world models, which are diffusion-based, stems from the lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model, representing the entire interaction history. This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model’s ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory. Project page: https://insait-institute.github.io/StateSpaceDiffuser/
PixPerfect: Seamless Latent Diffusion Local Editing with Discriminative Pixel-Space Refinement
Haitian Zheng · Yuan Yao · yongsheng yu · Yuqian Zhou · Jiebo Luo · Zhe Lin
Latent Diffusion Models (LDMs) have markedly advanced the quality of image inpainting and local editing. However, the inherent latent compression often introduces pixel-level inconsistencies, such as chromatic shifts, texture mismatches, and visible seams along editing boundaries. Existing remedies, including background-conditioned latent decoding and pixel-space harmonization, usually fail to fully eliminate these artifacts in practice and do not generalize well across different latent representations or tasks. We introduce PixPerfect, a pixel‐level refinement framework that delivers seamless, high-fidelity local edits across diverse LDM architectures and tasks. PixPerfect leverages (i) a differentiable discriminative pixel space that amplifies and suppresses subtle color and texture discrepancies, (ii) a comprehensive artifact simulation pipeline that exposes the refiner to realistic local editing artifacts during training, and (iii) a direct pixel-space refinement scheme that ensures broad applicability across diverse latent representations and tasks. Extensive experiments on inpainting, object removal, and insertion benchmarks demonstrate that PixPerfect substantially enhances perceptual fidelity and downstream editing performance, establishing a new standard for robust and high-fidelity localized image editing.
Boosting Generative Image Modeling via Joint Image-Feature Synthesis
Theodoros Kouzelis · Efstathios Karypidis · Ioannis Kakogeorgiou · Spyridon Gidaris · Nikos Komodakis
Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image modeling framework that seamlessly bridges this gap by leveraging a diffusion model to jointly model low-level image latents (from a variational autoencoder) and high-level semantic features (from a pretrained self-supervised encoder like DINO). Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise, significantly enhancing both generative quality and training efficiency, all while requiring only minimal modifications to standard Diffusion Transformer architectures. By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance, which leverages learned semantics to steer and refine image generation. Evaluated in both conditional and unconditional settings, our method delivers substantial improvements in image quality and training convergence speed, establishing a new direction for representation-aware generative modeling.
OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions
Yuanhao Cai · HE Zhang · Xi Chen · Jinbo Xing · Yiwei Hu · Yuqian Zhou · Kai Zhang · Zhifei Zhang · Soo Ye Kim · Tianyu Wang · Yulun Zhang · Xiaokang Yang · Zhe Lin · Alan Yuille
Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Project page is at https://caiyuanhao1998.github.io/project/OmniVCus/
Accelerating Parallel Diffusion Model Serving with Residual Compression
Jiajun Luo · Yicheng Xiao · Jianru Xu · Yangxiu You · Rongwei Lu · Chen Tang · Jingyan Jiang · Zhi Wang
Diffusion models produce realistic images and videos but require substantial computational resources, necessitating multi-accelerator parallelism for real-time deployment. However, parallel inference introduces significant communication overhead from exchanging large activations between devices, limiting efficiency and scalability. We present CompactFusion, a compression framework that significantly reduces communication while preserving generation quality. Our key observation is that diffusion activations exhibit strong temporal redundancy—adjacent steps produce highly similar activations, saturating bandwidth with near-duplicate data carrying little new information. To address this inefficiency, we seek a more compact representation that encodes only the essential information. CompactFusion achieves this via Residual Compression that transmits only compressed residuals (step-wise activation differences). Based on empirical analysis and theoretical justification, we show that it effectively removes redundant data, enabling substantial data reduction while maintaining high fidelity. We also integrate lightweight error feedback to prevent error accumulation. CompactFusion establishes a new paradigm for parallel diffusion inference, delivering lower latency and significantly higher generation quality than prior methods. On 4$\times$L20, it achieves $3.0\times$ speedup while greatly improving fidelity. It also uniquely supports communication-heavy strategies like sequence parallelism on slow networks, achieving $6.7\times$ speedup over prior overlap-based method. CompactFusion applies broadly across diffusion models and parallel settings, and integrates easily without requiring pipeline rework. Portable implementation demonstrated on xDiT is publicly available at https://github.com/Cobalt-27/CompactFusion
Image Stitching in Adverse Condition: A Bidirectional-Consistency Learning Framework and Benchmark
Zengxi Zhang · Junchen Ge · Zhiying Jiang · Miao Zhang · Jinyuan Liu
Deep learning-based image stitching methods have achieved promising performance on conventional stitching datasets. However, real-world scenarios may introduce challenges such as complex weather conditions, illumination variations, and dynamic scene motion, which severely degrade image quality and lead to significant misalignment in stitching results. To solve this problem, we propose an adverse condition-tolerant image stitching network, dubbed ACDIS. We first introduce a bidirectional consistency learning framework, which ensures reliable alignment through an iterative optimization paradigm that integrates differentiable image restoration and Gaussian-distribute encoded homography estimation. Subsequently, we incorporate motion constraints into the seamless composition network to produce robust stitching results without interference from moving scenes. We further propose the first adverse scene image stitching dataset, which covers diverse parallax and scenes under low-light, haze, and underwater environments. Extensive experiments show that the proposed method can generate visually pleasing stitched images under adverse conditions, outperforming state-of-the-art methods.
GenColor: Generative and Expressive Color Enhancement with Pixel-Perfect Texture Preservation
Yi Dong · Yuxi Wang · Xianhui Lin · Wenqi Ouyang · Zhiqi Shen · Peiran Ren · Ruoxi Fan · Rynson Lau
Color enhancement is a crucial yet challenging task in digital photography. It demands methods that are (i) expressive enough for fine-grained adjustments, (ii) adaptable to diverse inputs, and (iii) able to preserve texture. Existing approaches typically fall short in at least one of these aspects, yielding unsatisfactory results. We propose GenColor, a novel diffusion-based framework for sophisticated, texture-preserving color enhancement. GenColor reframes the task as conditional image generation. Leveraging ControlNet and a tailored training scheme, it learns advanced color transformations that adapt to diverse lighting and content. We train GenColor on ARTISAN, our newly collected large-scale dataset of 1.2M high-quality photographs specifically curated for enhancement tasks. To overcome texture preservation limitations inherent in diffusion models, we introduce a color-transfer network with a novel degradation scheme that simulates texture–color relationships. This network achieves pixel-perfect texture preservation while enabling fine-grained color matching with the diffusion-generated reference images. Extensive experiments show that GenColor produces visually compelling results comparable to those of expert colorists and surpasses state-of-the-art methods in both subjective and objective evaluations. We have released the code and dataset.
InstanceAssemble: Layout-Aware Image Generation via Instance Assembling Attention
Qiang Xiang · Shuang Sun · Binglei Li · Dejia Song · Huaxia Li · Yibo Chen · Xu Tang · Yao Hu · Junping Zhang
Diffusion models have demonstrated remarkable capabilities in generating high-quality images. Recent advancements in Layout-to-Image (L2I) generation have leveraged positional conditions and textual descriptions to facilitate precise and controllable image synthesis. Despite overall progress, current L2I methods still exhibit suboptimal performance. Therefore, we propose InstanceAssemble, a novel architecture that incorporates layout conditions via instance-assembling attention, enabling position control with bounding boxes (bbox) and multimodal content control including texts and additional visual content. Our method achieves flexible adaption to existing DiT-based T2I models through light-weighted LoRA modules. Additionally, we propose a Layout-to-Image benchmark, Denselayout, a comprehensive benchmark for layout-to-image generation, containing 5k images with 90k instances in total. We further introduce Layout Grounding Score (LGS), an interpretable evaluation metric to more precisely assess the accuracy of L2I generation. Experiments demonstrate that our InstanceAssemble method achieves state-of-the-art performance under complex layout conditions, while exhibiting strong compatibility with diverse style LoRA modules. The code and pretrained models are publicly available at \url{https://github.com/FireRedTeam/InstanceAssemble}.
Quantifying and Alleviating Co-Adaptation in Sparse-View 3D Gaussian Splatting
Kangjie Chen · Yingji Zhong · Zhihao Li · Jiaqi Lin · Youyu Chen · Minghan Qin · Haoqian Wang
3D Gaussian Splatting (3DGS) has demonstrated impressive performance in novel view synthesis under dense-view settings. However, in sparse-view scenarios, despite the realistic renderings in training views, 3DGS occasionally manifests appearance artifacts in novel views. This paper investigates the appearance artifacts in sparse-view 3DGS and uncovers a core limitation of current approaches: the optimized Gaussians are overly-entangled with one another to aggressively fit the training views, which leads to a neglect of the real appearance distribution of the underlying scene and results in appearance artifacts in novel views. The analysis is based on a proposed metric, termed Co-Adaptation Score (CA), which quantifies the entanglement among Gaussians, i.e., co-adaptation, by computing the pixel-wise variance across multiple renderings of the same viewpoint, with different random subsets of Gaussians. The analysis reveals that the degree of co-adaptation is naturally alleviated as the number of training views increases. Based on the analysis, we propose two lightweight strategies to explicitly mitigate the co-adaptation in sparse-view 3DGS: (1) random gaussian dropout; (2) multiplicative noise injection to the opacity. Both strategies are designed to be plug-and-play, and their effectiveness is validated across various methods and benchmarks. We hope that our insights into the co-adaptation effect will inspire the community to achieve a more comprehensive understanding of sparse-view 3DGS.
Multimodal LiDAR-Camera Novel View Synthesis with Unified Pose-free Neural Fields
Weiyi Xue · Fan Lu · Yunwei Zhu · Zehan Zheng · Haiyun Wei · Sanqing Qu · Jiangtong Li · Ya Wu · Guang Chen
Pose-free Neural Radiance Field (NeRF) aims at novel view synthesis (NVS) without relying on accurate poses, exhibiting significant practical value. Image and LiDAR point cloud are two pivotal modalities in autonomous driving scenarios. While demonstrating impressive performance, single-modality pose-free NeRFs often suffer from local optima due to the limited geometric information provided by dense image textures or the sparse, textureless nature of point clouds. Although prior methods have explored the complementary strengths of both modalities, they have only leveraged inherently sparse point clouds for discrete, non-pixel-wise depth supervision, and are limited to NVS of images. As a result, a Multimodal Unified Pose-free framework remains notably absent. In light of this, we propose MUP, a pose-free framework for LiDAR-Camera joint NVS in large-scale scenes. This unified framework enables continuous depth supervision for image reconstruction using LiDAR-Fields rather than discrete point clouds. By leveraging multimodal inputs, pose optimization receives gradients from the rendering loss of point cloud geometry and image texture, thereby alleviating the issue of local optima commonly encountered in single-modality pose-free tasks. Moreover, to further guide pose optimization of NeRF, we propose a multimodal geometric optimizer that leverages geometric relations from point clouds and photometric regularization from adjacent image frames. Besides, to alleviate the domain gap between modalities, we propose a multimodal-specific coarse-to-fine training approach for unified, compact reconstruction. Extensive experiments on KITTI-360 and NuScenes datasets demonstrate MUP's superiority in accomplishing geometry-aware, modality-consistent, and pose-free 3D reconstruction.
AtlasGS: Atlanta-world Guided Surface Reconstruction with Implicit Structured Gaussians
Xiyu Zhang · Chong Bao · YiPeng Chen · Hongjia Zhai · Yitong Dong · Hujun Bao · Zhaopeng Cui · Guofeng Zhang
3D reconstruction of indoor and urban environments is a prominent research topic with various downstream applications. However, existing geometric priors for addressing low-texture regions in indoor and urban settings often lack global consistency. Moreover, Gaussian Splatting and implicit SDF fields often suffer from discontinuities or exhibit computational inefficiencies, resulting in a loss of detail. To address these issues, we propose an Atlanta-world guided implicit-structured Gaussian Splatting that achieves smooth indoor and urban scene reconstruction while preserving high-frequency details and rendering efficiency. By leveraging the Atlanta-world model, we ensure the accurate surface reconstruction for low-texture regions, while the proposed novel implicit-structured GS representations provide smoothness without sacrificing efficiency and high-frequency details. Specifically, we propose a semantic GS representation to predict the probability of all semantic regions and deploy a structure plane regularization with learnable plane indicators for global accurate surface reconstruction. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in both indoor and urban scenes, delivering superior surface reconstruction quality.
Flux4D: Flow-based Unsupervised 4D Reconstruction
Jingkang Wang · Henry Che · Yun Chen · Ze Yang · Lily Goli · Sivabalan Manivasagam · Raquel Urtasun
Reconstructing large-scale dynamic scenes from visual observations is a fundamental challenge in computer vision, with critical implications for robotics and autonomous systems. While recent differentiable rendering methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have achieved impressive photorealistic reconstruction, they suffer from scalability limitations and require annotations to decouple actor motion. Existing self-supervised methods attempt to eliminate explicit annotations by leveraging motion cues and geometric priors, yet they remain constrained by per-scene optimization and sensitivity to hyperparameter tuning. In this paper, we introduce Flux4D, a simple and scalable framework for 4D reconstruction of large-scale dynamic scenes. Flux4D directly predicts 3D Gaussians and their motion dynamics to reconstruct sensor observations in a fully unsupervised manner. By adopting only photometric losses and enforcing an "as static as possible" regularization, Flux4D learns to decompose dynamic elements directly from raw data without requiring pre-trained supervised models or foundational priors simply by training across many scenes. Our approach enables efficient reconstruction of dynamic scenes within seconds, scales effectively to large datasets, and generalizes well to unseen environments, including rare and unknown objects. Experiments on outdoor driving datasets show Flux4D significantly outperforms existing methods in scalability, generalization, and reconstruction quality.
Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation
Chaoyang Wang · Ashkan Mirzaei · Vidit Goel · Willi Menapace · Aliaksandr Siarohin · Michael Vasilkovsky · Ivan Skorokhodov · Vladislav Shakhrai · Sergei Korolev · Sergey Tulyakov · Peter Wonka
We propose the first framework capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Our architecture has two main components, a 4D video model and a 4D reconstruction model. In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design. We highlight the limitations of existing approaches and introduce a novel fused architecture that performs spatial and temporal attention within a single layer. The key to our method is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training. Overall, we establish a new state of the art for 4D generation, improving both visual quality and reconstruction capability.
Metropolis-Hastings Sampling for 3D Gaussian Reconstruction
Hyunjin Kim · Haebeom Jung · Jaesik Park
We propose an adaptive sampling framework for 3D Gaussian Splatting (3DGS) that leverages comprehensive multi-view photometric error signals within a unified Metropolis-Hastings approach. Vanilla 3DGS heavily relies on heuristic-based density-control mechanisms (e.g., cloning, splitting, and pruning), which can lead to redundant computations or premature removal of beneficial Gaussians. Our framework overcomes these limitations by reformulating densification and pruning as a probabilistic sampling process, dynamically inserting and relocating Gaussians based on aggregated multi-view errors and opacity scores. Guided by Bayesian acceptance tests derived from these error-based importance scores, our method substantially reduces reliance on heuristics, offers greater flexibility, and adaptively infers Gaussian distributions without requiring predefined scene complexity. Experiments on benchmark datasets, including Mip-NeRF360, Tanks and Temples and Deep Blending, show that our approach reduces the number of Gaussians needed, achieving faster convergence while matching or modestly surpassing the view-synthesis quality of state-of-the-art models.
IBGS: Image-Based Gaussian Splatting
Hoang Chuong Nguyen · Wei Mao · Jose M. Alvarez · Miaomiao Liu
3D Gaussian Splatting (3DGS) has recently emerged as a fast, high-quality method for novel view synthesis (NVS). However, its use of low-degree spherical harmonics limits its ability to capture spatially varying color and view-dependent effects such as specular highlights. Existing works augment Gaussians with either a global texture map, which struggles with complex scenes, or per-Gaussian texture maps, which introduces high storage overhead. We propose Image-Based Gaussian Splatting, an efficient alternative that leverages high-resolution source images for fine details and view-specific color modeling. Specifically, we model each pixel color as a combination of a base color from standard 3DGS rendering and a learned residual inferred from neighboring training images. This promotes accurate surface alignment and enables rendering images of high-frequency details and accurate view-dependent effects. Experiments on standard NVS benchmarks show that our method significantly outperforms prior Gaussian Splatting approaches in rendering quality, without increasing the storage footprint.
Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation
Xiang Li · Zirui Wang · Zixuan Huang · James Rehg
Humans and traditional computer vision methods rely on a diverse set of monocular cues to infer 3D structure from a single image, such as shading, texture, silhouette, etc. While recent deep generative models have dramatically advanced single-image 3D generation, it remains unclear which image cues these methods actually exploit. We introduce Cue3D, the first comprehensive, model-agnostic framework for quantifying the influence of individual image cues in single-image 3D generation. Our unified benchmark evaluates seven state-of-the-art methods, spanning regression-based, multi-view, and native 3D generative paradigms. By systematically perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity, we measure their impact on 3D output quality. Our analysis reveals that shape meaningfulness, not texture, dictates generalization. Geometric cues, particularly shading, are crucial for 3D generation. We further identify over-reliance on provided silhouettes and diverse sensitivities to cues such as perspective and local continuity across model families. By dissecting these dependencies, Cue3D advances our understanding of how modern 3D networks leverage classical vision cues, and offers directions for developing more transparent, robust, and controllable single-image 3D generation models.
Complete Structure Guided Point Cloud Completion via Cluster- and Instance-Level Contrastive Learning
Yang Chen · Yirun Zhou · Weizhong Zhang · Cheng Jin
Point cloud completion, aiming to reconstruct missing part from incomplete point clouds, is a pivotal task in 3D computer vision. Traditional supervised approaches often necessitate complete point clouds for training supervision, which are not readily accessible in real-world applications. Recent studies have attempted to mitigate this dependency by employing self-supervise mechanisms. However, these approaches frequently yield suboptimal results due to the absence of complete structure in the point cloud data during training. To address these issues, in this paper, we propose an effective framework to complete the point cloud under the guidance of self learned complete structure. A key contribution of our work is the development of a novel self-supervised complete structure reconstruction module, which can learn the complete structure explicitly from incomplete point clouds and thus eliminate the reliance on training data from complete point clouds. Additionally, we introduce a contrastive learning approach at both the cluster- and instance-level to extract shape features guided by the complete structure and to capture style features, respectively. This dual-level learning design ensures that the generated point clouds are both shape-completed and detail-preserving. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach significantly outperforms state-of-the-art self-supervised methods.
LiteReality: Graphic-Ready 3D Scene Reconstruction from RGB-D Scans
Zhening Huang · Xiaoyang Wu · Fangcheng Zhong · Hengshuang Zhao · Matthias Niessner · Joan Lasenby
We propose LiteReality, a novel pipeline that converts RGB-D scans of indoor environments into compact, realistic, and interactive 3D virtual replicas. LiteReality not only reconstructs scenes that visually resemble reality but also supports key features essential for graphics pipelines, such as object individuality, articulation, and high-quality physically based rendering (PBR) materials. At its core, LiteReality first performs scene understanding and parses the results into a coherent 3D layout and object set with the help of a structured scene graph. It then reconstructs the scene by retrieving the most visually similar artist-crafted 3D models from a curated asset database. The Material Painting module further enhances realism by recovering high-quality, spatially varying materials from observed images. Finally, the reconstructed scene is integrated into a simulation engine, where basic physical properties are assigned to enable interactive behavior. The resulting scenes are compact, editable, and fully compatible with standard graphics pipelines, making them suitable for applications in AR/VR, gaming, robotics, and digital-twin systems. In addition, LiteReality introduces a training-free object-retrieval module that achieves state-of-the-art similarity performance on the Scan2CAD dataset, along with a robust Material Painting module capable of transferring appearances from images of any style to 3D assets—even under severe misalignment, occlusion, and poor lighting. We demonstrate the effectiveness of LiteReality on both real-world scans and standard public datasets.
SurfelSplat: Learning Efficient and Generalizable Gaussian Surfel Representations for Sparse-View Surface Reconstruction
Chensheng Dai · Shengjun Zhang · Min Chen · Yueqi Duan
3D Gaussian Splatting (3DGS) has demonstrated impressive performance in 3D scene reconstruction. Beyond novel view synthesis, it shows great potential for multi-view surface reconstruction. Existing methods employ optimization-based reconstruction pipelines that achieve precise and complete surface extractions. However, these approaches typically require dense input views and high time consumption for per-scene optimization. To address these limitations, we propose SurfaceSplat, a feed-forward framework that generates efficient and generalizable pixel-aligned Gaussian surfel representations from sparse-view images. We observe that conventional feed-forward structures struggle to recover accurate geometric attributes of Gaussian surfels because the spatial frequency of pixel-aligned primitives exceeds Nyquist sampling rates. Therefore, we propose a cross-view feature aggregation module based on the Nyquist sampling theorem. Specifically, we first adapt the geometric forms of Gaussian surfels with spatial sampling rate-guided low-pass filters. We then project the filtered surfels across all input views to obtain cross-view feature correlations. By processing these correlations through a specially designed feature fusion network, we can finally regress Gaussian surfels with precise geometry. Extensive experiments on DTU reconstruction benchmarks demonstrate that our model achieves comparable results with state-of-the-art methods, and predict Gaussian surfels within 1 second, offering a 100× speedup without costly per-scene training.
PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers
Yuchen Lin · Chenguo Lin · Panwang Pan · Honglei Yan · Feng Yiqiang · Yadong Mu · Katerina Fragkiadaki
We introduce PartCrafter, the first structured 3D generative model that jointly synthesizes multiple semantically meaningful and geometrically distinct 3D meshes from a single RGB image. Unlike existing methods that either produce monolithic 3D shapes or follow two-stage pipelines, i.e. first segmenting an image and then reconstructing each segment, PartCrafter adopts a unified, compositional generation architecture that does not rely on pre-segmented inputs. Conditioned on a single image, it simultaneously denoises multiple 3D parts, enabling end-to-end part-aware generation of both individual objects and complex multi-object scenes. PartCrafter builds upon a pretrained 3D mesh diffusion transformer (DiT) trained on whole objects, inheriting the pretrained weights, encoder, and decoder, and introduces two key innovations: (1) A compositional latent space, where each 3D part is represented by a set of disentangled latent tokens; (2) A hierarchical attention mechanism that enables structured information flow both within individual parts and across all parts, ensuring global coherence while preserving part-level detail during generation. To support part-level supervision, we curate a new dataset by mining part-level annotations from large-scale 3D object datasets. Experiments show that PartCrafter outperforms existing approaches in generating decomposable 3D meshes, including parts that are not directly visible in input images, demonstrating the strength of part-aware generative priors for 3D understanding and synthesis. Code and training data are released.
InFlux: A Benchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras
Erich Liang · Roma Bhattacharjee · Sreemanti Dey · Rafael Moschopoulos · Caitlin Wang · Michel Liao · Grace Tan · Andrew Wang · Karhan Kayan · Stamatis Alexandropoulos · Jia Deng
Accurately tracking camera intrinsics is crucial for achieving 3D understanding from 2D video. However, most 3D algorithms assume that camera intrinsics stay constant throughout a video, which is often not true for many real-world in-the-wild videos. A major obstacle in this field is a lack of dynamic camera intrinsics benchmarks--existing benchmarks typically offer limited diversity in scene content and intrinsics variation, and none provide per-frame intrinsic changes for consecutive video frames. In this paper, we present Intrinsics in Flux (InFlux), a real-world benchmark that provides per-frame ground truth intrinsics annotations for videos with dynamic intrinsics. Compared to prior benchmarks, InFlux captures a wider range of intrinsic variations and scene diversity, featuring 143K+ annotated frames from 386 high-resolution indoor and outdoor videos with dynamic camera intrinsics. To ensure accurate per-frame intrinsics, we build a comprehensive lookup table of calibration experiments and extend the Kalibr toolbox to improve its accuracy and robustness. Using our benchmark, we evaluate existing baseline methods for predicting camera intrinsics and find that most struggle to achieve accurate predictions on videos with dynamic intrinsics. For the dataset, code, videos, and submission, please visit https://influx.cs.princeton.edu/.
Depth-Supervised Fusion Network for Seamless-Free Image Stitching
Zhiying Jiang · Ruhao Yan · Zengxi Zhang · Bowei Zhang · Jinyuan Liu
Image stitching synthesizes images captured from multiple perspectives into a single image with a broader field of view. The significant variations in object depth often lead to large parallax, resulting in ghosting and misalignment in the stitched results. To address this, we propose a depth-consistency-constrained seamless-free image stitching method. First, to tackle the multi-view alignment difficulties caused by parallax, a multi-stage mechanism combined with global depth regularization constraints is developed to enhance the alignment accuracy of the same apparent target across different depth ranges. Second, during the multi-view image fusion process, an optimal stitching seam is determined through graph-based low-cost computation, and a soft-seam region is diffused to precisely locate transition areas, thereby effectively mitigating alignment errors induced by parallax and achieving natural and seamless stitching results. Furthermore, considering the computational overhead in the shift regression process, a reparameterization strategy is incorporated to optimize the structural design, significantly improving algorithm efficiency while maintaining optimal performance. Extensive experiments demonstrate the superior performance of the proposed method against the existing methods. Code is available at https://github.com/DLUT-YRH/DSFN.
VETA-DiT: Variance-Equalized and Temporally Adaptive Quantization for Efficient 4-bit Diffusion Transformers
Qinkai XU · yijin liu · YangChen · Lin Yang · Li Li · Yuxiang Fu
Diffusion Transformers (DiTs) have recently demonstrated remarkable performance in visual generation tasks, surpassing traditional U-Net-based diffusion models by significantly improving image and video generation quality and scalability. However, the large model size and iterative denoising process introduce substantial computational and memory overhead, limiting their deployment in real-world applications. Post-training quantization (PTQ) is a promising solution that compresses models and accelerates inference by converting weights and activations to low-bit representations. Despite its potential, PTQ faces significant challenges when applied to DiTs, often resulting in severe degradation of generative quality. To address these issues, we propose VETA-DiT (**V**ariance-**E**qualized and **T**emporal **A**daptation for **Di**ffusion **T**ransformers), a dedicated quantization framework for DiTs. Our method first analyzes the sources of quantization error from the perspective of inter-channel variance and introduces a Karhunen–Loève Transform enhanced alignment to equalize variance across channels, facilitating effective quantization under low bit-widths. Furthermore, to handle the temporal variation of activation distributions inherent in the iterative denoising steps of DiTs, we design an incoherence-aware adaptive method that identifies and properly calibrates timesteps with high quantization difficulty. We validate VETA-DiT on extensive image and video generation tasks, preserving acceptable visual quality under the more aggressive W4A4 configuration. Specifically, VETA-DiT reduces FID by 33.65 on the DiT-XL/2 model and by 45.76 on the PixArt-$\Sigma$ model compared to the baseline under W4A4, demonstrating its strong quantization capability and generative performance. Code is available at: https://github.com/xululi0223/VETA-DiT.
Policy Optimized Text-to-Image Pipeline Design
Uri Gadot · Rinon Gal · Yftah Ziser · Gal Chechik · Shie Mannor
Text-to-image generation has evolved beyond single monolithic models to complex multi-component pipelines that combine various enhancement tools. While these pipelines significantly improve image quality, their effective design requires substantial expertise. Recent approaches automating this process through large language models (LLMs) have shown promise but suffer from two critical limitations: extensive computational requirements from generating images with hundreds of predefined pipelines, and poor generalization beyond memorized training examples. We introduce a novel reinforcement learning-based framework that addresses these inefficiencies. Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations, eliminating the need for costly image generation during training. We then implement a two-phase training strategy: initial workflow prediction training followed by GRPO-based optimization that guides the model toward higher-performing regions of the workflow space. Additionally, we incorporate a classifier-free guidance based enhancement technique that extrapolates along the path between the initial and GRPO-tuned models, further improving output quality. We validate our approach through a set of comparisons, showing that it can successfully create new flows with greater diversity and lead to superior image quality compared to existing baselines.
Role Bias in Diffusion Models: Diagnosing and Mitigating through Intermediate Decomposition
Sina Malakouti · Adriana Kovashka
Text-to-image (T2I) diffusion models exhibit impressive photorealistic image generation capabilities, yet they struggle in compositional image generation. In this work, we introduce RoleBench, a benchmark focused on evaluating compositional generalization in action-based relations (e.g., "mouse chasing cat"). We show that state-of-the-art T2I models and compositional generation methods consistently default to frequent reversed relations (i.e., "cat chasing mouse"), a phenomenon we call role collapse. Related works attribute this to the model’s architectural limitation or underrepresentation in the data. Our key insight reveals that while models fail on rare compositions when their inversions are common, they can successfully generate similar intermediate compositions (e.g., "mouse chasing boy"), suggesting that this limitation is also due to the presence of frequent counterparts rather than just the absence of rare compositions. Motivated by this, we hypothesize that directional decomposition can gradually mitigate role collapse. We test this via ReBind, a lightweight framework that teaches role bindings using carefully selected active/passive intermediate compositions. Experiments suggest that intermediate compositions through simple fine-tuning can significantly reduce role collapse, with humans preferring ReBind more than 78% compared to state-of-the-art methods. Our findings highlight the role of distributional asymmetries in compositional failures and offer a simple, effective path for improving generalization.
LightFair: Towards an Efficient Alternative for Fair T2I Diffusion via Debiasing Pre-trained Text Encoders
Boyu Han · Qianqian Xu · Shilong Bao · Zhiyong Yang · Kangli Zi · Qingming Huang
This paper explores a novel lightweight approach LightFair to achieve fair text-to-image diffusion models (T2I DMs) by addressing the adverse effects of the text encoder. Most existing methods either couple different parts of the diffusion model for full-parameter training or rely on auxiliary networks for correction. They incur heavy training or sampling burden and unsatisfactory performance. Since T2I DMs consist of multiple components, with the text encoder being the most fine-tunable and front-end module, this paper focuses on mitigating bias by fine-tuning text embeddings. To validate feasibility, we observe that the text encoder’s neutral embedding output shows substantial skewness across image embeddings of various attributes in the CLIP space. More importantly, the noise prediction network further amplifies this imbalance. To finetune the text embedding, we propose a collaborative distance-constrained debiasing strategy that balances embedding distances to improve fairness without auxiliary references. However, mitigating bias can compromise the original generation quality. To address this, we introduce a two-stage text-guided sampling strategy to limit when the debiased text encoder intervenes. Extensive experiments demonstrate that LightFair is effective and efficient. Notably, on Stable Diffusion v1.5, our method achieves SOTA debiasing at just $1/4$ of the training burden, with virtually no increase in sampling burden. The code is available at https://github.com/boyuh/LightFair.
EditInfinity: Image Editing with Binary-Quantized Generative Models
Jiahuan Wang · Yuxin Chen · Jun Yu · Guangming Lu · Wenjie Pei
Adapting pretrained diffusion-based generative models for text-driven image editing with negligible tuning overhead has demonstrated remarkable potential. A classical adaptation paradigm, as followed by these methods, first infers the generative trajectory inversely for a given source image by image inversion, then performs image editing along the inferred trajectory guided by the target text prompts. However, the performance of image editing is heavily limited by the approximation errors introduced during image inversion by diffusion models, which arise from the absence of exact supervision in the intermediate generative steps. To circumvent this issue, we investigate the parameter-efficient adaptation of VQ-based generative models for image editing, and leverage their inherent characteristic that the exact intermediate quantized representations of a source image are attainable, enabling more effective supervision for precise image inversion. Specifically, we propose \emph{EditInfinity}, which adapts \emph{Infinity}, a binary-quantized generative model, for image editing. We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation, enabling precise image inversion. Furthermore, we devise a holistic smoothing strategy which allows our \emph{EditInfinity} to perform image editing with high fidelity to source images and precise semantic alignment to the text prompts. Extensive experiments on the PIE-Bench benchmark across "add", "change", and "delete" editing operations, demonstrate the superior performance of our model compared to state-of-the-art diffusion-based baselines. Code available at: \url{https://github.com/yx-chen-ust/EditInfinity}.
PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors
Sepehr Dehdashtian · Mashrur Mahmud Morshed · Jacob Seidman · Gaurav Bharaj · Vishnu Boddeti
Synthetic image detectors (SIDs) are a key defense against the risks posed by the growing realism of images from text-to-image (T2I) models. Red teaming improves SID’s effectiveness by identifying and exploiting their failure modes via misclassified synthetic images. However, existing red-teaming solutions (i) require white-box access to SIDs, which is infeasible for proprietary state-of-the-art detectors, and (ii) generate image-specific attacks through expensive online optimization. To address these limitations, we propose PolyJuice, the first black-box, image-agnostic red-teaming method for SIDs, based on an observed distribution shift in the T2I latent space between samples correctly and incorrectly classified by the SID. PolyJuice generates attacks by (i) identifying the direction of this shift through a lightweight offline process that only requires black-box access to the SID, and (ii) exploiting this direction by universally steering all generated images towards the SID’s failure modes. PolyJuice-steered T2I models are significantly more effective at deceiving SIDs (up to 84%) compared to their unsteered counterparts. We also show that the steering directions can be estimated efficiently at lower resolutions and transferred to higher resolutions using simple interpolation, reducing computational overhead. Finally, tuning SID models on PolyJuice-augmented datasets notably enhances the performance of the detectors (up to 30%).
Instant4D: 4D Gaussian Splatting in Minutes
Zhanpeng Luo · Haoxi Ran · Li Lu
Dynamic view synthesis has seen significant advances, yet reconstructing scenes from uncalibrated, casual video remains challenging due to slow optimization and complex parameter estimation. In this work, we present Instant4D, a monocular reconstruction system that leverages native 4D representation to efficiently process casual video sequences within minutes, without calibrated cameras or depth sensors. Our method begins with geometric recovery through deep visual SLAM, followed by grid pruning to optimize scene representation. Our design significantly reduces redundancy while maintaining geometric integrity, cutting model size to under 10% of its original footprint. To handle temporal dynamics efficiently, we introduce a streamlined 4D Gaussian representation, achieving a 30× speed-up and reducing training time to within two minutes, while maintaining competitive performance across several benchmarks. Our method reconstruct a single video within 10 minutes on the Dycheck dataset or for a typical 200-frame video. We further apply our model to in-the-wild videos, showcasing its generalizability. Our project website is published at https://instant4d.github.io/.
Adaptive 3D Reconstruction via Diffusion Priors and Forward Curvature-Matching Likelihood Updates
Seunghyeok Shin · Dabin Kim · Hongki Lim
Reconstructing high-quality point clouds from images remains challenging in computer vision. Existing generative models, particularly diffusion models, based approaches that directly learn the posterior may suffer from inflexibility—they require conditioning signals during training, support only a fixed number of input views, and need complete retraining for different measurements. Recent diffusion-based methods have attempted to address this by combining prior models with likelihood updates, but they rely on heuristic fixed step sizes for the likelihood update that lead to slow convergence and suboptimal reconstruction quality. We advance this line of approach by integrating our novel Forward Curvature-Matching (FCM) update method with diffusion sampling. Our method dynamically determines optimal step sizes using only forward automatic differentiation and finite-difference curvature estimates, enabling precise optimization of the likelihood update. This formulation enables high-fidelity reconstruction from both single-view and multi-view inputs, and supports various input modalities through simple operator substitution—all without retraining. Experiments on ShapeNet and CO3D datasets demonstrate that our method achieves superior reconstruction quality at matched or lower NFEs, yielding higher F-score and lower CD and EMD, validating its efficiency and adaptability for practical applications. Code is available at https://github.com/Seunghyeok0715/FCM.
LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering
Jonas Kulhanek · Marie-Julie Rakotosaona · Fabian Manhardt · Christina Tsalicoglou · Michael Niemeyer · Torsten Sattler · Songyou Peng · Federico Tombari
In this work, we present a novel level-of-detail (LOD) method for 3D Gaussian Splatting that enables real-time rendering of large-scale scenes on memory-constrained devices. Our approach introduces a hierarchical LOD representation that iteratively selects optimal subsets of Gaussians based on camera distance, thus largely reducing both rendering time and GPU memory usage. We construct each LOD level by applying a depth-aware 3D smoothing filter, followed by importance-based pruning and fine-tuning to maintain visual fidelity. To further reduce memory overhead, we partition the scene into spatial chunks and dynamically load only relevant Gaussians during rendering, employing an opacity-blending mechanism to avoid visual artifacts at chunk boundaries. Our method achieves state-of-the-art performance on both outdoor (Hierarchical 3DGS) and indoor (Zip-NeRF) datasets, delivering high-quality renderings with reduced latency and memory requirements.
Pose Splatter: A 3D Gaussian Splatting Model for Quantifying Animal Pose and Appearance
Jack Goffinet · Youngjo Min · Carlo Tomasi · David Carlson
Accurate and scalable quantification of animal pose and appearance is crucial for studying behavior. Current 3D pose estimation techniques, such as keypoint- and mesh-based techniques, often face challenges including limited representational detail, labor-intensive annotation requirements, and expensive per-frame optimization. These limitations hinder the study of subtle movements and can make large-scale analyses impractical. We propose Pose Splatter, a novel framework leveraging shape carving and 3D Gaussian splatting to model the complete pose and appearance of laboratory animals without prior knowledge of animal geometry, per-frame optimization, or manual annotations. We also propose a rotation-invariant visual embedding technique for encoding pose and appearance, designed to be a plug-in replacement for 3D keypoint data in downstream behavioral analyses. Experiments on datasets of mice, rats, and zebra finches show Pose Splatter learns accurate 3D animal geometries. Notably, Pose Splatter represents subtle variations in pose, provides better low-dimensional pose embeddings over state-of-the-art as evaluated by humans, and generalizes to unseen data. By eliminating annotation and per-frame optimization bottlenecks, Pose Splatter enables analysis of large-scale, longitudinal behavior needed to map genotype, neural activity, and behavior at high resolutions.
MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation
Kerui Ren · Jiayang Bai · Linning Xu · Lihan Jiang · Jiangmiao Pang · Mulin Yu · Bo Dai
Object compositing offers significant promise for augmented reality (AR) and embodied intelligence applications. Existing approaches predominantly focus on single-image scenarios or intrinsic decomposition techniques, facing challenges with multi-view consistency, complex scenes, and diverse lighting conditions. Recent inverse rendering advancements, such as 3D Gaussian and diffusion-based methods, have enhanced consistency but are limited by scalability, heavy data requirements, or prolonged reconstruction time per scene. To broaden its applicability, we introduce MV-CoLight, a two-stage framework for illumination-consistent object compositing in both 2D images and 3D scenes. Our novel feed-forward architecture models lighting and shadows directly, avoiding the iterative biases of diffusion-based methods. We employ a Hilbert curve-based mapping to align 2D image inputs with 3D Gaussian scene representations seamlessly. To facilitate training and evaluation, we further introduce a large-scale 3D compositing dataset. Experiments demonstrate state-of-the-art harmonized results across standard benchmarks and our dataset, as well as casually captured real-world scenes demonstrate the framework's robustness and wide generalization.
WildCAT3D: Appearance-Aware Multi-View Diffusion in the Wild
Morris Alper · David Novotny · Filippos Kokkinos · Hadar Averbuch-Elor · Tom Monnier
Despite recent advances in sparse novel view synthesis (NVS) applied to object-centric scenes, scene-level NVS remains a challenge. A central issue is the lack of available clean multi-view training data, beyond manually curated datasets with limited diversity, camera variation, or licensing issues. On the other hand, an abundance of diverse and permissively-licensed data exists in the wild, consisting of scenes with varying appearances (illuminations, transient occlusions, etc.) from sources such as tourist photos. To this end, we present WildCAT3D, a framework for generating novel views of scenes learned from diverse 2D scene image data cap tured in the wild. We unlock training on these data sources by explicitly modeling global appearance conditions in images, extending the state-of-the-art multi-view diffusion paradigm to learn from scene views of varying appearances. Our trained model generalizes to new scenes at inference time, enabling the generation of multiple consistent novel views. WildCAT3D provides state-of-the-art results on single-view NVS in object- and scene-level settings, while training on strictly fewer data sources than prior methods. Additionally, it enables novel applications by providing global appearance control during generation.
PyraMotion: Attentional Pyramid-Structured Motion Integration for Co-Speech 3D Gesture Synthesis
Zhizhuo Yin · Yuk Hang Tsui · Pan Hui
Generating full-body human gestures encompassing face, body, hands, and global movements from audio is crucial yet challenging for virtual avatar creation. Existing systems tokenize gestures frame-wise, predicting tokens of each frame from the input audio. However, expressive human gestures consist of varied patterns with different frame lengths, and different body parts exhibit motion patterns of varying durations. Existing systems fail to capture motion patterns across body parts and temporal scales due to the fixed frame-count setting of their gesture tokens. Inspired by the success of the feature pyramid technique in the multi-scale visual information extraction, we propose a novel framework named PyraMotion and an adaptive multi-scale feature capturing model called Attentive Pyramidal VQ-VAE (APVQ-VAE). Objective and subjective experiments demonstrate that the PyraMotion outperforms state-of-the-art methods in terms of generating natural and expressive full-body human gestures. Extensive ablation experiments highlight that the self-adaptiveness integration through attention maps contributes to performance.
Visual Sync: Multi‑Camera Synchronization via Cross‑View Object Motion
Shaowei Liu · David Yao · Saurabh Gupta · Shenlong Wang
Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross‑camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi‑view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co‑visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off‑the‑shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross‑view correspondences. It then jointly minimizes the epipolar error to estimate each camera’s time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an average synchronization error below 130 ms.
Test3R: Learning to Reconstruct 3D at Test Time
Yuheng Yuan · Qiuhong Shen · Shizun Wang · Xingyi Yang · Xinchao Wang
Dense matching methods like DUSt3R regress pairwise pointmaps for 3D reconstruction. However, the reliance on pairwise prediction and the limited generalization capability inherently restrict the global geometric consistency. In this work, we introduce \textbf{Test3R}, a surprisingly simple test-time learning technique that significantly boosts geometric accuracy. Using image triplets ($I_1,I_2,I_3$), Test3R generates reconstructions from pairs ($I_1,I_2$) and ($I_1,I_3$). The core idea is to optimize the network at test time via a self-supervised objective: maximizing the geometric consistency between these two reconstructions relative to the common image $I_1$. This ensures the model produces cross-pair consistent outputs, regardless of the inputs. Extensive experiments demonstrate that our technique significantly outperforms previous state-of-the-art methods on the 3D reconstruction and multi-view depth estimation tasks. Moreover, it is universally applicable and nearly cost-free, making it easily applied to other models and implemented with minimal test-time training overhead and parameter footprint. Code is available at https://github.com/nopQAQ/Test3R.
Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video
Jixuan He · Chieh Lin · Lu Qi · Ming-Hsuan Yang
Motion is one of the key components in deformable 3D scenes. Generative video models allow users to animate static scenes with text prompts for novel motion, but when it comes to 4D reconstruction, such reanimations often fall apart. The generated videos often suffer from geometric artifacts, implausible motion, and occlusions, which hinder physically consistent 4D reanimation. In this work, we introduce \textbf{Restage4D}, a geometry-preserving pipeline for deformable scene reconstruction from a single edited video. Our key insight is to leverage the unedited original video as an additional source of supervision, allowing the model to propagate accurate structure into occluded and disoccluded regions. To achieve this, we propose a video-rewinding training scheme that temporally bridges the edited and original sequences via a shared motion representation. We further introduce an occlusion-aware ARAP regularization to preserve local rigidity, and a disocclusion backtracing mechanism that supplements missing geometry in the canonical space. Together, these components enable robust reconstruction even when the edited input contains hallucinated content or inconsistent motion. We validate Restage4D on DAVIS and PointOdyssey, demonstrating improved geometry consistency, motion quality, and 3D tracking performance. Our method not only preserves deformable structure under novel motion, but also automatically corrects errors introduced by generative models, bridging the gap between flexible video synthesis and physically grounded 4D reconstruction.
Non-Line-of-Sight 3D Reconstruction with Radar
Haowen Lai · Zitong Lan · Mingmin Zhao
Seeing hidden structures and objects around corners is critical for robots operating in complex, cluttered environments. Existing methods, however, are limited to detecting and tracking hidden objects rather than reconstructing the occluded full scene. We present HoloRadar, a practical system that reconstructs both line-of-sight (LOS) and non-line-of-sight (NLOS) 3D scenes using a single mmWave radar. HoloRadar uses a two-stage pipeline: the first stage generates high-resolution multi-return range images that capture both LOS and NLOS reflections, and the second stage reconstructs the physical scene by mapping mirrored observations to their true locations using a physics-guided architecture that models ray interactions. We deploy HoloRadar on a mobile robot and evaluate it across diverse real-world environments. Our evaluation results demonstrate accurate and robust reconstruction in both LOS and NLOS regions. Code, dataset and demo videos are available on the project website: https://waves.seas.upenn.edu/projects/holoradar.
Repurposing Marigold for Zero-Shot Metric Depth Estimation via Defocus Blur Cues
Chinmay Talegaonkar · Nikhil Gandudi Suresh · Zachary Novack · Yash Belhe · Priyanka Nagasamudra · Nicholas Antipa
Recent monocular metric depth estimation (MMDE) methods have made notable progress towards zero-shot generalization. However, they still exhibit a significant performance drop on out-of-distribution datasets. We address this limitation by injecting defocus blur cues at inference time into Marigold, a \textit{pre-trained} diffusion model for zero-shot, scale-invariant monocular depth estimation (MDE). Our method effectively turns Marigold into a metric depth predictor in a training-free manner. To incorporate defocus cues, we capture two images with a small and a large aperture from the same viewpoint. To recover metric depth, we then optimize the metric depth scaling parameters and the noise latents of Marigold at inference time using gradients from a loss function based on the defocus-blur image formation model. We compare our method against existing state-of-the-art zero-shot MMDE methods on a self-collected real dataset, showing quantitative and qualitative improvements.
Towards Dynamic 3D Reconstruction of Hand-Instrument Interaction in Ophthalmic Surgery
Ming Hu · Zhengdi Yu · feilong tang · Kaiwen Chen · Yulong Li · Imran Razzak · Junjun He · Tolga Birdal · Kai-Jing Zhou · Zongyuan Ge
Accurate 3D reconstruction of hands and instruments is critical for vision-based analysis of ophthalmic microsurgery, yet progress has been hampered by the lack of realistic, large-scale datasets and reliable annotation tools. In this work, we introduce OphNet-3D, the first extensive RGB-D dynamic 3D reconstruction dataset for ophthalmic surgery, comprising 41 sequences from 40 surgeons and totaling 7.1 million frames, with fine-grained annotations of 12 surgical phases, 10 instrument categories, dense MANO hand meshes, and full 6-DoF instrument poses. To scalably produce high-fidelity labels, we design a multi-stage automatic annotation pipeline that integrates multi-view data observation, data-driven motion prior with cross-view geometric consistency and biomechanical constraints, along with a combination of collision-aware interaction constraints for instrument interactions. Building upon OphNet-3D, we establish two challenging benchmarks—bimanual hand pose estimation and hand–instrument interaction reconstruction—and propose two dedicated architectures: H-Net for dual-hand mesh recovery and OH-Net for joint reconstruction of two-hand–two-instrument interactions. These models leverage a novel spatial reasoning module with weak-perspective camera modeling and collision-aware center-based representation. Both architectures outperform existing methods by substantial margins, achieving improvements of over 2mm in Mean Per Joint Position Error (MPJPE) and up to 23\% in ADD-S metrics for hand and instrument reconstruction, respectively.
Deep Gaussian from Motion: Exploring 3D Geometric Foundation Models for Gaussian Splatting
Yu Chen · Rolandos Alexandros Potamias · Evangelos Ververas · Jifei Song · Jiankang Deng · Gim Hee Lee
Neural radiance fields (NeRF) and 3D Gaussian Splatting (3DGS) are popular techniques to reconstruct and render photorealistic images. However, the prerequisite of running Structure-from-Motion (SfM) to get camera poses limits their completeness. Although previous methods can reconstruct a few unposed images, they are not applicable when images are unordered or densely captured. In this work, we propose a method to train 3DGS from unposed images. Our method leverages a pre-trained 3D geometric foundation model as the neural scene representation. Since the accuracy of the predicted pointmaps does not suffice for accurate image registration and high-fidelity image rendering, we propose to mitigate the issue by initializing and fine-tuning the pre-trained model from a seed image. The images are then progressively registered and added to the training buffer, which is used to train the model further. We also propose to refine the camera poses and pointmaps by minimizing a point-to-camera ray consistency loss across multiple views. When evaluated on diverse challenging datasets, our method outperforms state-of-the-art pose-free NeRF/3DGS methods in terms of both camera pose accuracy and novel view synthesis, and even renders higher fidelity images than 3DGS trained with COLMAP poses.
Wukong's 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models
Minghao Yin · Yukang Cao · Kai Han
We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (text or images) as input. Unlike conventional methods -- which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) -- WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This avoids common artifacts like oversmoothing while maintaining semantic fidelity. Through extensive quantitative and qualitative evaluations, we demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.
DAVE: Diagnostic benchmark for Audio Visual Evaluation
Gorjan Radevski · Teodora Popordanoska · Matthew Blaschko · Tinne Tuytelaars
Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- when answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE: Diagnostic Audio Visual Evaluation, a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled settings. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models.Dataset: https://huggingface.co/datasets/gorjanradevski/daveCode: https://github.com/gorjanradevski/dave
MMPB: It’s Time for Multi-Modal Personalization
Jaeik Kim · Woojin Kim · Woohyeon Park · Jaeyoung Do
Visual personalization is essential in user-facing AI systems such as smart homes and healthcare, where aligning model behavior with user-centric concepts is critical. However, recent large Vision-Language Models (VLMs), despite their broad applicability, remain underexplored in their ability to adapt to individual users. In this paper, we introduce MMPB, the first extensive benchmark for evaluating VLMs on personalization. MMPB comprises 10k image-query pairs and includes 111 personalizable concepts across four categories: humans, animals, objects, and characters, with the human category enriched with preference-grounded queries. We structure personalization into three main task types, each highlighting a different key property of VLMs. Using 23 widely used VLMs including both open- and closed-source models, we evaluate personalization performance via a three-stage protocol: concept injection, multi-turn dialogue, and personalized querying. Our findings indicate that most VLMs (including some closed-source models) struggle with personalization, particularly in maintaining consistency over dialogue, handling user preferences, and adapting to visual cues. Our analysis reveals that the challenges in VLM personalization (such as refusal behaviors and long-context forgetting) highlight substantial room for improvement. By identifying these limitations and offering a scalable benchmark, MMPB offers valuable insights and a solid foundation for future research toward truly personalized multi-modal AI.
From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos
Animesh Gupta · Jay Parmar · Ishan Rajendrakumar Dave · Mubarak Shah
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving, and provides 180K triplets drawn from FineGym and FineDiving datasets. Previous CoVR benchmarks, focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each
ConViS-Bench: Estimating Video Similarity Through Semantic Concepts
Benedetta Liberatori · Alessandro Conti · Lorenzo Vaquero · Yiming Wang · Elisa Ricci · Paolo Rota
What does it mean for two videos to be similar? Videos may appear similar when judged by the actions they depict, yet entirely different if evaluated based on the locations where they were filmed. While humans naturally compare videos by taking different aspects into account, this ability has not been thoroughly studied and presents a challenge for models that often depend on broad global similarity scores. Large Multimodal Models (LMMs) with video understanding capabilities open new opportunities for leveraging natural language in comparative video tasks. We introduce Concept-based Video Similarity estimation (ConViS), a novel task that compares pairs of videos by computing interpretable similarity scores across a predefined set of key semantic concepts. ConViS allows for human-like reasoning about video similarity and enables new applications such as concept-conditioned video retrieval. To support this task, we also introduce ConViS-Bench, a new benchmark comprising carefully annotated video pairs spanning multiple domains. Each pair comes with concept-level similarity scores and textual descriptions of both differences and similarities. Additionally, we benchmark several state-of-the-art models on ConViS, providing insights into their alignment with human judgments. Our results reveal significant performance differences on ConViS, indicating that some concepts present greater challenges for estimating video similarity. We believe that ConViS-Bench will serve as a valuable resource for advancing research in language-driven video understanding.
Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation
Tien Nguyen · Dac Nguyen · Duc Nguyen The Minh · Trung Thanh Nguyen · Truong Thao Nguyen · Hieu Pham · Johan Barthelemy · Tran Minh Quan · Quoc Viet Hung Nguyen · Thanh Tam Nguyen · Mai Son · Chau Anh · Thanh Nguyen · Phi Le Nguyen
Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence (AI) by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset consisting of 2,757 whole-body PET/CT volumes from independent patients and their corresponding full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs' learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks, including medical report generation and visual question answering. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. However, despite these advancements, the models still underperform on clinically critical criteria, particularly the diagnosis of lung cancer, indicating substantial room for future improvement. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, particularly in low-resource languages, and improving their clinical relevance in Vietnamese healthcare.
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
Zhongxing Xu · Chengzhi Liu · Qingyue Wei · Juncheng Wu · James Zou · Xin Wang · Yuyin Zhou · Sheng Liu
Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, we observe that this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more on language priors. Attention analysis reveals that longer reasoning chains reduce focus on visual inputs, contributing to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, enabling evaluation of whether the model preserves visual grounding while reasoning. We also release RH-Bench, a diagnostic benchmark covering diverse multimodal tasks, designed to jointly assess the balance of reasoning ability and hallucination. We find that (i) larger models generally exhibit a better balance between reasoning and perception; (ii) reasoning and perception balance depends more on the types and domains of the training data than its volume. Our findings highlight the need for evaluation frameworks that account for both reasoning quality and perceptual reliability.
Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model
Tianle Li · Jihai Zhang · Yongming Rao · Yu Cheng
While large language models (LLMs) demonstrate strong reasoning capabilities utilizing reinforcement learning (RL) with verifiable reward, whether large vision-language models (VLMs) can directly inherit such capabilities through similar post-training strategies remains underexplored. In this work, we conduct a systematic compositional probing study to evaluate whether current VLMs trained with RL or other post-training strategies can compose capabilities across modalities or tasks under out-of-distribution conditions. We design a suite of diagnostic tasks that train models on unimodal tasks or isolated reasoning skills, and evaluate them on multimodal, compositional variants requiring skill integration. Through comparisons between supervised fine-tuning (SFT) and RL-trained models, we identify three key findings: (1) RL-trained models consistently outperform SFT on compositional generalization, demonstrating better integration of learned skills; (2) although VLMs achieve strong performance on individual tasks, they struggle to generalize compositionally under cross-modal and cross-task scenarios, revealing a significant gap in current training strategies; (3) enforcing models to explicitly describe visual content before reasoning (e.g., caption-before-thinking), along with rewarding progressive vision-to-text grounding, yields notable gains. It highlights two essential ingredients for improving compositionality in VLMs: visual-to-text alignment and accurate visual grounding. Our findings shed light on the current limitations of RL-based reasoning VLM training and provide actionable insights toward building models that reason compositionally across modalities and tasks.
With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You
Fabian Gröger · Shuo Wen · Huyen Le · Maria Brbic
Multimodal models have demonstrated powerful capabilities in complex tasks requiring multimodal alignment, including zero-shot classification and cross-modal retrieval. However, existing models typically rely on millions of paired multimodal samples, which are prohibitively expensive or infeasible to obtain in many domains. In this work, we explore the feasibility of building multimodal models with limited amount of paired data by aligning pretrained unimodal foundation models. We show that high-quality alignment is possible with as few as tens of thousands of paired samples$\unicode{x2013}$less than 1\% of the data typically used in the field. To achieve this, we introduce STRUCTURE, an effective regularization technique that preserves the neighborhood geometry of the latent space of unimodal encoders. Additionally, we show that aligning last layers is often suboptimal and demonstrate the benefits of aligning the layers with the highest representational similarity across modalities. These two components can be readily incorporated into existing alignment methods, yielding substantial gains across 24 zero-shot image classification and retrieval benchmarks, with average relative improvement of 51.6\% in classification and 91.8\% in retrieval tasks. Our results highlight the effectiveness and broad applicability of our framework for limited-sample multimodal learning and offer a promising path forward for resource-constrained domains.
When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions
Zhuo Cao · Heming Du · Bingqing Zhang · Xin Yu · Xue Li · Sen Wang
Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR). However, one query can correspond to multiple relevant moments in real-world applications. This makes the existing datasets and methods insufficient for video temporal grounding. By revisiting the gap between current MR tasks and real-world applications, we introduce a high-quality datasets called QVHighlights Multi-Moment Dataset (QV-M$^2$), along with new evaluation metrics tailored for multi-moment retrieval (MMR). QV-M$^2$ consists of 2,212 annotations covering 6,384 video segments. Building on existing efforts in MMR, we propose a framework called FlashMMR. Specifically, we propose a Multi-moment Post-verification module to refine the moment boundaries. We introduce constrained temporal adjustment and subsequently leverage a verification module to re-evaluate the candidate segments. Through this sophisticated filtering pipeline, low-confidence proposals are pruned, and robust multi-moment alignment is achieved. We retrain and evaluate 6 existing MR methods on QV-M$^2$ and QVHighlights under both SMR and MMR settings. Results show that QV-M$^2$ serves as an effective benchmark for training and evaluating MMR models, while FlashMMR provides a strong baseline. Specifically, on QV-M$^2$, it achieves improvements over prior SOTA method by 3.00% on G-mAP, 2.70% on mAP@3+tgt, and 2.56% on mR@3. The proposed benchmark and method establish a foundation for advancing research in more realistic and challenging video temporal grounding scenarios. Code is released at https://github.com/Zhuo-Cao/QV-M2.
Multimodal Causal Reasoning for UAV Object Detection
Nianxin Li · Mao Ye · Lihua Zhou · Shuaifeng Li · Song Tang · Luping Ji · Ce Zhu
Unmanned Aerial Vehicle (UAV) object detection faces significant challenges due to complex environmental conditions and different imaging conditions. These factors introduce significant changes in scale and appearance, particularly for small objects that occupy limited pixels and exhibit limited information, complicating detection tasks. To address these challenges, we propose a Multimodel Causal Reasoning framework based on YOLO backbone for UAV Object Detection (MCR-UOD). The key idea is to use the backdoor adjustment to discover the condition-invariant object representation for easy detection. Specifically, the YOLO backbone is first adjusted to incorporate the pre-trained vision-language model. The original category labels are replaced with semantic text prompts, and the detection head is replaced with text-image contrastive learning. Based on this backbone, our method consists of two parts. The first part, named language guided region exploration, discovers the regions with high probability of object existence using text embeddings based on vision-language model such as CLIP. Another part is the backdoor adjustment casual reasoning module, which constructs a confounder dictionary tailored to different imaging conditions to capture global image semantics and derives a prior probability distribution of shooting conditions. During causal inference, we use the confounder dictionary and the prior to intervene on local instance features, disentangling condition variations, and obtaining condition-invariant representations. Experimental results on several public datasets confirm the state-of-the-art performance of our approach. The code, data and models will be released upon publication of this paper.
Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
Zaiquan Yang · Yuhao LIU · Gerhard Hancke · Rynson Lau
Spatio-temporal video grounding (STVG) aims at localizing the spatio-temporal tube of a video, as specified by the input text query. In this paper, we utilize multimodal large language models (MLLMs) to explore a zero-shot solution in STVG. We reveal two key insights about MLLMs: (1) MLLMs tend to dynamically assign special tokens, referred to as \textit{grounding tokens}, for grounding the text query; and (2) MLLMs often suffer from suboptimal grounding due to the inability to fully integrate the cues in the text query (\textit{e.g.}, attributes, actions) for inference. Based on these insights, we propose a MLLM-based zero-shot framework for STVG, which includes novel decomposed spatio-temporal highlighting (DSTH) and temporal-augmented assembling (TAS) strategies to unleash the reasoning ability of MLLMs. The DSTH strategy first decouples the original query into attribute and action sub-queries for inquiring the existence of the target both spatially and temporally. It then uses a novel logit-guided re-attention (LRA) module to learn latent variables as spatial and temporal prompts, by regularizing token predictions for each sub-query. These prompts highlight attribute and action cues, respectively, directing the model's attention to reliable spatial and temporal related visual regions. In addition, as the spatial grounding by the attribute sub-query should be temporally consistent, we introduce the TAS strategy to assemble the predictions using the original video frames and the temporal-augmented frames as inputs to help improve temporal consistency. We evaluate our method on various MLLMs, and show that it outperforms SOTA methods on three common STVG benchmarks.
MemEIC: A Step Toward Continual and Compositional Knowledge Editing
Jin Seong · Jiyun Park · Wencke Liermann · Hongseok Choi · Yoonji Nam · Hyun Kim · Soojong Lim · Namhoon Lee
The dynamic nature of information necessitates continuously updating large vision-language models (LVLMs). While recent knowledge editing techniques hint at promising directions, they often focus on editing a single modality (vision or language) in isolation. This prevalent practice neglects the inherent multimodality of LVLMs and the continuous nature of knowledge updates, potentially leading to suboptimal editing outcomes when considering the interplay between modalities and the need for ongoing knowledge refinement. To address these limitations, we propose MemEIC, a novel method for Continual and Compositional Knowledge Editing (CCKE) in LVLMs. MemEIC enables compositional editing of both visual and textual knowledge sequentially. Our approach employs a hybrid external-internal editor featuring a dual external memory for cross-modal evidence retrieval and dual LoRA adapters that facilitate disentangled parameter updates for each modality. A key component is a brain-inspired knowledge connector, activated selectively for compositional reasoning, that integrates information across different modalities. Experiments demonstrate that MemEIC significantly improves performance on complex multimodal questions and effectively preserves prior edits, setting a new benchmark for CCKE in LVLMs. Our project is available at https://github.com/MemEIC/MemEIC.
Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation
Jitesh Jain · Zhengyuan Yang · Humphrey Shi · Jianfeng Gao · Jianwei Yang
In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data, which are critical for tasks involving spatial reasoning in the domain of embodied AI and robotics. Is it possible to optimize both at the same time? In this work, we propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's (of an MLLM) hidden representations. We start by investigating MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Given this insight, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next (text) token prediction. Moreover, through extensive probing, we observe improved visual representation quality due to embedding optimization, underscoring the effectiveness of our probing setup. We demonstrate that our VisPer-LM outperforms the single and multi-encoder baselines, proving our approach's superiority over explicitly feeding the corresponding features to the LLM. In particular, VisPer-LM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
CoFFT: Chain of Foresight-Focus Thought for Visual Language Models
Xinyu Zhang · Yuxuan Dong · Lingling Zhang · Chengyou Jia · Zhuohang Dang · Basura Fernando · Jun Liu · Mike Zheng Shou
Despite significant advances in Vision Language Models (VLMs), they remain constrained by the complexity and redundancy of visual input. When images contain large amounts of irrelevant information, VLMs are susceptible to interference, thus generating excessive task-irrelevant reasoning processes or even hallucinations. This limitation stems from their inability to discover and process the required regions during reasoning precisely. To address this limitation, we present the Chain of Foresight-Focus Thought (CoFFT), a novel training-free approach that enhances VLMs' visual reasoning by emulating human visual cognition. Each Foresight-Focus Thought consists of three stages: (1) Diverse Sample Generation: generates diverse reasoning samples to explore potential reasoning paths, where each sample contains several reasoning steps; (2) Dual Foresight Decoding: rigorously evaluates these samples based on both visual focus and reasoning progression, adding the first step of optimal sample to the reasoning process; (3) Visual Focus Adjustment: precisely adjust visual focus toward regions most beneficial for future reasoning, before returning to stage (1) to generate subsequent reasoning samples until reaching the final answer. These stages function iteratively, creating an interdependent cycle where reasoning guides visual focus and visual focus informs subsequent reasoning. Empirical results across multiple benchmarks using Qwen2.5-VL, InternVL-2.5, and Llava-Next demonstrate consistent performance improvements of 3.1-5.8\% with controllable increasing computational overhead.
A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1
Zhaoyi Li · Xiaohan Zhao · Dong-Dong Wu · Jiacheng Cui · Zhiqiang Shen
Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against closed-source commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial black-box LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we propose to refine semantic clarity by encoding explicit semantic details within local regions, thus ensuring the capture of finer-grained features and inter-model transferability, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose *a simple yet highly effective baseline*: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. While the naive source-target matching method has been utilized before in the literature, we are the first to provide a tight analysis, which establishes a close connection between perturbation optimization and semantics. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5/3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90\% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods with lower $\ell_1/\ell_2$ perturbations. Our optimized adversarial examples under different configurations are available at https://huggingface.co/datasets/MBZUAI-LLM/M-Attack_AdvSamples and our training code at https://github.com/VILA-Lab/M-Attack.
Pancakes: Consistent Multi-Protocol Image Segmentation Across Biomedical Domains
Marianne Rakic · Siyu Gai · Etienne Chollet · John Guttag · Adrian Dalca
A single biomedical image can be segmented in multiple valid ways, depending on the application. For instance, a brain MRI may be divided according to tissue types, vascular territories, broad anatomical regions, fine-grained anatomy, or pathology. Existing automatic segmentation models typically either (1) support only a single protocol---the one they were trained on---or (2) require labor-intensive prompting to specify the desired segmentation. We introduce Pancakes, a framework that, given a new image from a previously unseen domain, automatically generates multi-label segmentation maps for multiple plausible protocols, while maintaining semantic consistency across related images. In extensive experiments across seven previously unseen domains, Pancakes consistently outperforms strong baselines, often by a wide margin, demonstrating its ability to produce diverse yet coherent segmentation maps on unseen domains.
VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models
Silin Cheng · Kai Han
Vision-language models (VLMs), such as CLIP, have shown strong generalization under zero-shot settings, yet adapting them to downstream tasks with limited supervision remains a significant challenge. Existing multi-modal prompt learning methods typically rely on fixed, shared prompts and deterministic parameters, which limits their ability to capture instance-level variation or model uncertainty across diverse tasks and domains. To tackle this issue, we propose a novel Variational Multi-Modal Prompt Learning (VaMP) framework that enables sample-specific, uncertainty-aware prompt tuning in multi-modal representation learning. VaMP generates instance-conditioned prompts by sampling from a learned posterior distribution, allowing the model to personalize its behavior based on input content. To further enhance the integration of local and global semantics, we introduce a class-aware prior derived from the instance representation and class prototype. Building upon these, we formulate prompt tuning as variational inference over latent prompt representations and train the entire framework end-to-end through reparameterized sampling. Experiments on few-shot and domain generalization benchmarks show that VaMP achieves state-of-the-art performance, highlighting the benefits of modeling both uncertainty and task structure in our method. Project page: https://visual-ai.github.io/vamp
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Chaoyou Fu · Haojia Lin · Xiong Wang · yifan zhang · Yunhang Shen · Xiaoyu Liu · Haoyu Cao · Zuwei Long · Heting Gao · Ke Li · Long MA · Xiawu Zheng · Rongrong Ji · Xing Sun · Caifeng Shan · Ran He
Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing against state-of-the-art counterparts across benchmarks for image, video, and speech, we demonstrate that our omni model is equipped with both strong visual and speech capabilities, making omni understanding and interaction.
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
Zekun Qi · Wenyao Zhang · Yufei Ding · Runpei Dong · XinQiang Yu · Jingwen Li · Lingyun Xu · Baoyu Li · Xialin He · Guofan Fan · Jiazhao Zhang · Jiawei He · Jiayuan Gu · Xin Jin · Kaisheng Ma · Zhizheng Zhang · He Wang · Li Yi
While spatial reasoning has made progress in object localization relationships, it often overlooks object orientation—a key factor in 6-DoF fine-grained manipulation. Traditional pose representations rely on pre-defined frames or templates, limiting generalization and semantic grounding. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a cup). To support this, we construct OrienText300K, a large-scale dataset of 3D objects annotated with semantic orientations, and develop PointSO, a general model for zero-shot semantic orientation prediction. By integrating semantic orientation into VLM agents, our SoFar framework enables 6-DoF spatial reasoning and generates robotic actions. Extensive experiments demonstrated the effectiveness and generalization of our SoFar, e.g., zero-shot 48.7\% successful rate on Open6DOR and zero-shot 74.9\% successful rate on SIMPLER-Env.
MMOT: The First Challenging Benchmark for Drone-based Multispectral Multi-Object Tracking
Tianhao Li · Tingfa Xu · Ying Wang · Haolin Qin · Xu Lin · Jianan Li
Drone-based multi-object tracking is essential yet highly challenging due to small targets, severe occlusions, and cluttered backgrounds. Existing RGB-based multi-object tracking algorithms heavily depend on spatial appearance cues such as color and texture, which often degrade in aerial views, compromising tracking reliability. Multispectral imagery, capturing pixel-level spectral reflectance, provides crucial spectral cues that significantly enhance object discriminability under degraded spatial conditions. However, the lack of dedicated multispectral UAV datasets has hindered progress in this domain. To bridge this gap, we introduce MMOT, the first challenging benchmark for drone-based multispectral multi-object tracking dataset. It features three key characteristics: (i) Large Scale — 125 video sequences with over 488.8K annotations across eight object categories; (ii) Comprehensive Challenges — covering diverse real-world challenges such as extreme small targets, high-density scenarios, severe occlusions and complex platform motion; and (iii) Precise Oriented Annotations — enabling accurate localization and reduced object ambiguity under aerial perspectives. To better extract spectral features and leverage oriented annotations, we further present a multispectral and orientation-aware MOT scheme adapting existing MOT methods, featuring: (i) a lightweight Spectral 3D-Stem integrating spectral features while preserving compatibility with RGB pretraining; (ii) a orientation-aware Kalman filter for precise state estimation; and (iii) an end-to-end orientation-adaptive transformer architecture. Extensive experiments across representative trackers consistently show that multispectral input markedly improves tracking performance over RGB baselines, particularly for small and densely packed objects. We believe our work will benefit the community for advancing drone-based multispectral multi-object tracking research. Our MMOT, code and benchmarks are publicly available at https://github.com/Annzstbl/MMOT.
DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response
Junjue Wang · Weihao Xuan · Heli Qi · Zhihao Liu · Kunyi Liu · Yuhan Wu · Hongruixuan Chen · JIAN SONG · Junshi Xia · Zhuo Zheng · Naoto YOKOYA
Large vision-language models (VLMs) have made great achievements in Earth vision. However, complex disaster scenes with diverse disaster types, geographic regions, and satellite sensors have posed new challenges for VLM applications. To fill this gap, we curate the first remote sensing vision-language dataset (DisasterM3) for global-scale disaster assessment and response. DisasterM3 includes 26,988 bi-temporal satellite images and 123k instruction pairs across 5 continents, with three characteristics: **1) Multi-hazard**: DisasterM3 involves 36 historical disaster events with significant impacts, which are categorized into 10 common natural and man-made disasters. **2) Multi-sensor**: Extreme weather during disasters often hinders optical sensor imaging, making it necessary to combine Synthetic Aperture Radar (SAR) imagery for post-disaster scenes. **3) Multi-task**: Based on real-world scenarios, DisasterM3 includes 9 disaster-related visual perception and reasoning tasks, harnessing the full potential of VLM's reasoning ability with progressing from disaster-bearing body recognition to structural damage assessment and object relational reasoning, culminating in the generation of long-form disaster reports. We extensively evaluated 14 generic and remote sensing VLMs on our benchmark, revealing that state-of-the-art models struggle with the disaster tasks, largely due to the lack of a disaster-specific corpus, cross-sensor gap, and damage object counting insensitivity. Focusing on these issues, we fine-tune four VLMs using our dataset and achieve stable improvements (up to 10.4\%$\uparrow$QA, 2.1$\uparrow$Report, 40.8\%$\uparrow$Referring Seg.) with robust cross-sensor and cross-disaster generalization capabilities. Project: https://github.com/Junjue-Wang/DisasterM3.
ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
Yijun Liang · Ming Li · Chenrui Fan · Ziyue Li · Dang Nguyen · Kwesi Cobbina · Shweta Bhardwaj · Jiuhai Chen · Fuxiao Liu · Tianyi Zhou
Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans.This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks.These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBench can serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.
Towards Understanding Camera Motions in Any Video
Zhiqiu Lin · Siyuan Cen · Daniel Jiang · Jay Karhade · Hewei Wang · Chancharik Mitra · Yu Tong Tiffany Ling · Yuhan Huang · Rushikesh Zawar · Xue Bai · Yilun Du · Chuang Gan · Deva Ramanan
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our core contributions is a taxonomy or "language" of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while generative VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
Tianhao Peng · Haochen Wang · Yuanxing Zhang · Noah Wang · Zili Wang · Ge Zhang · Jian Yang · Shihao Li · Yanghai Wang · Xintao Wang · Houyi Li · Wei Ji · Pengfei Wan · Wenhao Huang · ZHAO-XIANG ZHANG · Jiaheng Liu
The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing evaluation benchmarks remain limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world scenarios (e.g., sports analytics and autonomous driving). To address this significant gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs. Specifically, our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos from diverse domains, addressing both fundamental perception tasks and high-order reasoning tasks. These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics. Through extensive evaluation of state-of-the-art open-source and closed-source models, we reveal significant performance discrepancies and limitations in current MLLMs' ability to perform understanding across multiple videos.The benchmark will be made publicly available to foster future research.
EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?
Yuqian Yuan · Ronghao Dang · long li · Wentong Li · Dian Jiao · Xin Li · Deli Zhao · Fan Wang · Wenqiao Zhang · Jun Xiao · Yueting Zhuang
The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object's appearance and spatial attributes while neglecting the assessment of dynamic changes arising from users' interactions.capabilities in object-level spatiotemporal reasoning required for real-world interactions.To address this gap, we introduce EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios.Specially, EOC-Bench features 3,277 meticulously annotated QA pairs categorized into three temporal categories: Past, Present, and Future, covering 11 fine-grained evaluation dimensions and 3 visual object referencing types.To ensure thorough assessment, we develop a mixed-format human-in-the-loop annotation frameworkBased on EOC-Bench, we conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs. EOC-Bench serves as a crucial tool for advancing the embodied object cognitive capabilities of MLLMs, establishing a robust foundation for developing reliable core models for embodied systems.
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO
Yicheng Xiao · Lin Song · Yukang Chen · Yingmin Luo · Yuxin Chen · Yukang Gan · Wei Huang · Xiu Li · Xiaojuan Qi · Ying Shan
Recent text-to-image systems face limitations in handling multimodal inputs and complex reasoning tasks. We introduce MindOmni, a unified multimodal large language model that addresses these challenges by incorporating reasoning generation through reinforcement learning. MindOmni leverages a three-phase training strategy: i) design of a unified vision language model with a decoder-only diffusion module, ii) supervised fine-tuning with Chain-of-Thought (CoT) instruction data, and iii) our proposed Reasoning Generation Policy Optimization (RGPO) algorithm, utilizing multimodal feedback to effectively guide policy updates. Experimental results demonstrate that MindOmni outperforms existing models, achieving impressive performance on both understanding and generation benchmarks, meanwhile showcasing advanced fine-grained reasoning generation capabilities, especially with mathematical reasoning instruction. All codes will be made public.
Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models
Wei Chen · Xin Yan · Bin Wen · Fan Yang · Tingting Gao · Di ZHANG · Long Chen
Although multimodal large language models (MLLMs) exhibit remarkable reasoning capabilities on complex multimodal understanding tasks, they still suffer from the notorious 'hallucination' issue: generating outputs misaligned with obvious visual or factual evidence. Currently, training-based solutions, like direct preference optimization (DPO), leverage paired preference data to suppress hallucinations. However, they risk sacrificing general reasoning capabilities due to the likelihood displacement. Meanwhile, training-free solutions, like contrastive decoding, achieve this goal by subtracting the estimated hallucination pattern from a distorted input. Yet, these handcrafted perturbations (e.g., add noise to images) may poorly capture authentic hallucination patterns. To avoid these weaknesses of existing methods, and realize ``robust'' hallucination mitigation (\ie, maintaining general reasoning performance), we propose a novel framework: Decoupling Contrastive Decoding (DCD). Specifically, DCD decouples the learning of positive and negative samples in preference datasets, and trains separate positive and negative image projections within the MLLM. The negative projection implicitly models real hallucination patterns, which enables vision-aware negative images in the contrastive decoding inference stage. Our DCD alleviates likelihood displacement by avoiding pairwise optimization and generalizes robustly without handcrafted degradation. Extensive ablations across hallucination benchmarks and general reasoning tasks demonstrate the effectiveness of DCD, \ie, it matches DPO’s hallucination suppression while preserving general capabilities and outperforms the handcrafted contrastive decoding methods.
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
Xiyao Wang · Zhengyuan Yang · Chao Feng · Hongjin Lu · Linjie Li · Chung-Ching Lin · Kevin Lin · Furong Huang · Lijuan Wang
We introduce ThinkLite-VL, a family of visual reasoning models that achieve state-of-the-art (SoTA) performance using an order of magnitude fewer training samples, relying purely on reinforcement fine-tuning (RFT) self-improvement without any knowledge distillation. Our central insight is that sample difficulty critically influences RFT effectiveness: appropriately challenging examples can drive substantial reasoning improvements, even in low-data regimes. However, quantifying sample difficulty in a reliable and scalable manner remains non-trivial. To address this, we repurpose Monte Carlo Tree Search (MCTS) to measure sample difficulty via the number of reasoning iterations a vision-language model (VLM) requires to solve each instance. This MCTS-based selection procedure identifies samples that induce deeper reasoning while remaining solvable, allowing us to filter a high-quality subset from 70k open-source examples spanning math, natural image understanding, and chart comprehension. Using this approach, we select just 11k challenging samples for RFT on Qwen2.5-VL-7B-Instruct and 7.5k samples for Qwen2.5-VL-72B-Instruct. The resulting models, ThinkLite-VL-7B and ThinkLite-VL-72B, significantly outperform their respective base models across eight visual reasoning benchmarks. In particular, ThinkLite-VL-7B improves the average performance of Qwen2.5-VL-7B-Instruct by 7\% and surpasses all existing 7B-level models, as well as much larger models such as GPT-4o, O1 and Qwen2.5-VL-72B, achieving a new SoTA score of 75.1 on MathVista. ThinkLite-VL-72B further advances the SoTA frontier, achieving an accuracy of 79.7 on MathVista and an average benchmark improvement of 4.42 over the open-source SOTA. These results demonstrate that MCTS-guided difficulty filtering provides a scalable and effective path toward data-efficient self-improvement in multimodal reasoning.
SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models
Pingyi Chen · Yujing Lou · Shen Cao · Jinhui Guo · Lubin Fan · Yue Wu · Lin Yang · Lizhuang Ma · Jieping Ye
While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains underexplored due to the deficiency of spatial representation ability of 2D images. In this paper, we analyze the problem hindering VLMs’ spatial understanding abilities and propose SD-VLM, a novel framework that significantly enhances fundamental spatial perception abilities of VLMs through two key contributions: (1) propose Massive Spatial Measuring and Understanding (MSMU) dataset with precise spatial annotations, and (2) introduce a simple depth positional encoding method strengthening VLMs’ spatial awareness. MSMU dataset includes massive quantitative spatial tasks with 700K QA pairs, 2.5M physical numerical annotations, and 10K chain-of-thought augmented samples. We have trained SD-VLM, a strong generalist VLM which shows superior quantitative spatial measuring and understanding capability. SD-VLM not only achieves state-of-the-art performance on our proposed MSMU-Bench, but also shows spatial generalization abilities on other spatial understanding benchmarks including Q-Spatial and SpatialRGPTBench. Extensive experiments demonstrate that SD-VLM outperforms GPT-4o and Intern-VL3-78B by 26.91% and 25.56% respectively on MSMU-Bench. Code and models are released at https://github.com/cpystan/SD-VLM.
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Senqiao Yang · Junyi Li · Xin Lai · Jinming Wu · Wei Li · Zejun MA · Bei Yu · Hengshuang Zhao · Jiaya Jia
Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreoever, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. All our code and data are open-sourced.
GUIDED: Granular Understanding via Identification, Detection, and Discrimination for Fine-Grained Open-Vocabulary Object Detection
Jiaming Li · Zhijia Liang · Weikai Chen · Lin Ma · Guanbin Li
Fine-grained open-vocabulary object detection (FG-OVD) aims to detect novel object categories described by attribute-rich texts. While existing open-vocabulary detectors show promise at the base-category level, they underperform in fine-grained settings due to the semantic entanglement of subjects and attributes in pretrained vision-language model (VLM) embeddings -- leading to over-representation of attributes, mislocalization, and semantic drift in embedding space. We propose GUIDED, a decomposition framework specifically designed to address the semantic entanglement between subjects and attributes in fine-grained prompts. By separating object localization and fine-grained recognition into distinct pathways, GUIDED aligns each subtask with the module best suited for its respective roles. Specifically, given a fine-grained class name, we first use a language model to extract a coarse-grained subject and its descriptive attributes. Then the detector is guided solely by the subject embedding, ensuring stable localization unaffected by irrelevant or overrepresented attributes. To selectively retain helpful attributes, we introduce an attribute embedding fusion module that incorporates attribute information into detection queries in an attention-based manner. This mitigates over-representation while preserving discriminative power. Finally, a region-level attribute discrimination module compares each detected region against full fine-grained class names using a refined vision-language model with a projection head for improved alignment. Extensive experiments on FG-OVD and 3F-OVD benchmarks show that GUIDED achieves new state-of-the-art results, demonstrating the benefits of disentangled modeling and modular optimization.
QuARI: Query Adaptive Retrieval Improvement
Eric Xing · Abby Stylianou · Robert Pless · Nathan Jacobs
Massive-scale pretraining has made vision-language models increasingly popular for image-to-image and text-to-image retrieval across a broad collection of domains. However, these models do not perform well when used for challenging retrieval tasks, such as instance retrieval in very large-scale image collections. Recent work has shown that linear transformations of VLM features trained for instance retrieval can improve performance by emphasizing subspaces that relate to the domain of interest. In this paper, we explore a more extreme version of this specialization by learning to map a given query to a query-specific feature space transformation. Because this transformation is linear, it can be applied with minimal computational cost to millions of image embeddings, making it effective for large-scale retrieval or re-ranking. Results show that this method consistently outperforms state-of-the-art alternatives, including those that require many orders of magnitude more computation at query time.
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Diankun Wu · Fangfu Liu · Yi-Hsin Hung · Yueqi Duan
Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a 3D spatial encoder—initialized from the backbone of the visual geometry model—to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct a training dataset from multiple sources and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that Spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks.
Efficient Rectified Flow for Image Fusion
Zirui Wang · Jiayi Zhang · Tianwei Guan · Yuhan Zhou · Xingyuan Li · Minjing Dong · Jinyuan Liu
Image fusion is a fundamental and important task in computer vision, aiming to combine complementary information from different modalities to fuse images. In recent years, diffusion models have made significant developments in the field of image fusion. However, diffusion models often require complex computations and redundant inference time, which reduces the applicability of these methods. To address this issue, we propose RFfusion, an efficient one-step diffusion model for image fusion based on Rectified Flow. We incorporate Rectified Flow into the image fusion task to straighten the sampling path in the diffusion model, achieving one-step sampling without the need for additional training, while still maintaining high-quality fusion results. Furthermore, we propose a task-specific variational autoencoder (VAE) architecture tailored for image fusion, where the fusion operation is embedded within the latent space to further reduce computational complexity. To address the inherent discrepancy between conventional reconstruction-oriented VAE objectives and the requirements of image fusion, we introduce a two-stage training strategy. This approach facilitates the effective learning and integration of complementary information from multi-modal source images, thereby enabling the model to retain fine-grained structural details while significantly enhancing inference efficiency. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods in terms of both inference speed and fusion quality.
Towards General Continuous Memory for Vision-Language Models
Wenyi WU · Zixuan Song · Kun Zhou · Yifei Shao · Zhiting Hu · Biwei Huang
Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual real world knowledge. To support such capabilities, an external memory system that can efficiently provide relevant multimodal information is essential. Existing approaches generally concatenate image and text tokens into a long sequence as memory, which, however, may drastically increase context length and even degrade performance. In contrast, we propose using continuous memory-a compact set of dense embeddings-to more effectively and efficiently represent multimodal and multilingual knowledge. Our key insight is that a VLM can serve as its own continuous memory encoder. We empirically show that this design improves performance on complex multimodal reasoning tasks. Building on this, we introduce a data-efficient and parameter-efficient method to fine-tune the VLM into a memory encoder, requiring only 1.2\% of the model’s parameters and a small corpus of 15.6K self-synthesized samples. Our approach CoMEM utilizes VLM's original capabilities to encode arbitrary multimodal and multilingual knowledge into just 8 continuous embeddings. Since the inference-time VLM remains frozen, our memory module is plug-and-play and can be flexibly integrated as needed. Extensive experiments across eight multimodal reasoning benchmarks demonstrate the effectiveness of our approach. Code and data is publicly released here https://github.com/WenyiWU0111/CoMEM.
Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling
Bryan Wong · Jongwoo Kim · Huazhu Fu · Mun Yi
Vision-language models (VLMs) have recently been integrated into multiple instance learning (MIL) frameworks to address the challenge of few-shot, weakly supervised classification of whole slide images (WSIs). A key trend involves leveraging multi-scale information to better represent hierarchical tissue structures. However, existing methods often face two key limitations: (1) insufficient modeling of interactions within the same modalities across scales (e.g., 5x and 20x) and (2) inadequate alignment between visual and textual modalities on the same scale. To address these gaps, we propose HiVE-MIL, a hierarchical vision-language framework that constructs a unified graph consisting of (1) parent–child links between coarse (5x) and fine (20x) visual/textual nodes to capture hierarchical relationships, and (2) heterogeneous intra-scale edges linking visual and textual nodes on the same scale. To further enhance semantic consistency, HiVE-MIL incorporates a two-stage, text-guided dynamic filtering mechanism that removes weakly correlated patch–text pairs, and introduces a hierarchical contrastive loss to align textual semantics across scales. Extensive experiments on TCGA breast, lung, and kidney cancer datasets demonstrate that HiVE-MIL consistently outperforms both traditional MIL and recent VLM-based MIL approaches, achieving gains of up to 4.1% in macro F1 under 16-shot settings. Our results demonstrate the value of jointly modeling hierarchical structure and multimodal alignment for efficient and scalable learning from limited pathology data. The code is available at https://github.com/bryanwong17/HiVE-MIL.
ControlFusion: A Controllable Image Fusion Network with Language-Vision Degradation Prompts
Linfeng Tang · Yeda Wang · Zhanchuan Cai · Junjun Jiang · Jiayi Ma
Current image fusion methods struggle with real-world composite degradations and lack the flexibility to accommodate user-specific needs. To address this, we propose ControlFusion, a controllable fusion network guided by language-vision prompts that adaptively mitigates composite degradations. On the one hand, we construct a degraded imaging model based on physical mechanisms, such as the Retinex theory and atmospheric scattering principle, to simulate composite degradations and provide a data foundation for addressing realistic degradations. On the other hand, we devise a prompt-modulated restoration and fusion network that dynamically enhances features according to degradation prompts, enabling adaptability to varying degradation levels. To support user-specific preferences in visual quality, a text encoder is incorporated to embed user-defined degradation types and levels as degradation prompts. Moreover, a spatial-frequency collaborative visual adapter is designed to autonomously perceive degradations from source images, thereby reducing complete reliance on user instructions. Extensive experiments demonstrate that ControlFusion outperforms SOTA fusion methods in fusion quality and degradation handling, particularly under real-world and compound degradations.
One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution
Yujing Sun · Lingchen Sun · Shuaizheng Liu · Rongyuan Wu · Zhengqiang ZHANG · Lei Zhang
It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models will be released.
Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models
zhentao he · Can Zhang · Ziheng Wu · Zhenghao Chen · Yufei Zhan · Yifan Li · Zhao Zhang · Xian Wang · Minghui Qiu
Recent advancements in multimodal large language models (MLLMs) have enhanced document understanding by integrating textual and visual information. However, existing models exhibit incompleteness within their paradigm in real-world scenarios, particularly under visual degradation (e.g., blur, occlusion, low contrast). In such conditions, the current response paradigm often fails to adequately perceive visual degradation and ambiguity, leading to overreliance on linguistic priors or misaligned visual-textual reasoning. This difficulty in recognizing uncertainty frequently results in the generation of hallucinatory content, especially when a precise answer is not feasible. To better demonstrate and analyze this phenomenon and problem, we propose KIE-HVQA, the first benchmark dedicated to evaluating OCR hallucination in degraded document understanding. This dataset includes test samples spanning identity cards, invoices, and prescriptions, with simulated real-world degradations and pixel-level annotations for OCR reliability. This setup allows for evaluating models' capacity, under degraded input, to distinguish reliable visual information and answer accordingly, thereby highlighting the challenge of avoiding hallucination on uncertain data. To achieve vision-faithful reasoning and thereby avoid the aforementioned issues, we further introduce a Group Relative Policy Optimization (GRPO)-based framework featuring a novel reward mechanism. By incorporating a self-awareness of visual uncertainty and an analysis method that initiates refusal to answer to increase task difficulty within our supervised fine-tuning and reinforcement learning framework, we successfully mitigated hallucinations in ambiguous regions. Experiments on Qwen2.5-VL demonstrate that our 7B-parameter model achieves a ~28% absolute improvement in hallucination-free accuracy over GPT-4o on KIE-HVQA and there is no significant performance drop in standard tasks, highlighting both effectiveness and robustness. This work advances the development of reliable MLLMs for real-world document analysis by addressing critical challenges in visual-linguistic alignment under degradation.
SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models
Ye Sun · Hao Zhang · Henghui Ding · Tiehua Zhang · Xingjun Ma · Yu-Gang Jiang
Achieving fine-grained spatio-temporal understanding in videos remains a major challenge for current Video Large Multimodal Models (Video LMMs). Addressing this challenge requires mastering two core capabilities: video referring understanding, which captures the semantics of video regions, and video grounding, which segments object regions based on natural language descriptions. However, most existing approaches tackle these tasks in isolation, limiting progress toward unified, referentially grounded video interaction. We identify a key bottleneck in the lack of high-quality, unified video instruction data and a comprehensive benchmark for evaluating referentially grounded video chat. To address these challenges, we contribute in three core aspects: dataset, model, and benchmark. First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically curated to enable joint learning of video referring understanding, grounding, and multi-turn video chat. Second, we propose the SAMA model, which incorporates a versatile spatio-temporal context aggregator and a Segment Anything Model to jointly enhance fine-grained video comprehension and precise grounding capabilities. Finally, we establish SAMA-Bench, a meticulously designed benchmark consisting of 5,067 questions from 522 videos, to comprehensively evaluate the integrated capabilities of Video LMMs in multi-turn, spatio-temporal referring understanding and grounded dialogue. Extensive experiments and benchmarking results show that SAMA not only achieves strong performance on SAMA-Bench but also sets a new state-of-the-art on general grounding benchmarks, while maintaining highly competitive performance on standard visual understanding benchmarks.
UniTok: a Unified Tokenizer for Visual Generation and Understanding
Chuofan Ma · Yi Jiang · Junfeng Wu · Jihan Yang · Xin Yu · Zehuan Yuan · BINGYUE PENG · Xiaojuan Qi
Visual generative and understanding models typically rely on distinct tokenizers to process images, presenting a key challenge for unifying them within a single framework. Recent studies attempt to address this by connecting the training of VQVAE (for autoregressive generation) and CLIP (for understanding) to build a unified tokenizer. However, directly combining these training objectives has been observed to cause severe loss conflicts. In this paper, we show that reconstruction and semantic supervision do not inherently conflict. Instead, the underlying bottleneck stems from limited representational capacity of discrete token space. Building on these insights, we introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism that effectively scales up the vocabulary size and bottleneck dimension. In terms of final performance, UniTok sets a new record of 0.38 rFID and 78.6\% zero-shot accuracy on ImageNet. Besides, UniTok can be seamlessly integrated into MLLMs to unlock native visual generation capability, without compromising the understanding performance. Additionally, we show that UniTok favors cfg-free generation, reducing gFID from 14.6 to 2.5 on ImageNet 256$\times$256 benchmark. All codes and models have been made publicly available.
FlowFeat: Pixel-Dense Embedding of Motion Profiles
Nikita Araslanov · Anna Sonnweber · Daniel Cremers
Dense and versatile image representations underpin the success of virtually all computer vision applications. However, state-of-the-art networks, such as transformers, produce low-resolution feature grids, which are suboptimal for dense prediction tasks. To address this limitation, we present FlowFeat, a high-resolution and multi-task feature representation. The key ingredient behind FlowFeat is a novel distillation technique that embeds a distribution of plausible apparent motions, or motion profiles. By leveraging optical flow networks and diverse video data, we develop an effective self-supervised training framework that statistically approximates the apparent motion. With its remarkable level of spatial detail, FlowFeat encodes a compelling degree of geometric and semantic cues while exhibiting high temporal consistency. Empirically, FlowFeat significantly enhances the representational power of five state-of-the-art encoders and alternative upsampling strategies across three dense tasks: video object segmentation, monocular depth estimation and semantic segmentation. Training FlowFeat is computationally inexpensive and robust to inaccurate flow estimation, remaining highly effective even when using unsupervised flow networks. Our work takes a step forward towards reliable and versatile dense image representations.
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding
Minsoo Kim · Kyuhong Shim · Jungwook Choi · Simyung Chang
Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key–value (KV) cache grows linearly with time—quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for \textit{streaming} video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94\%, sustains real-time generation, and matches or surpasses full-cache accuracy—even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.
On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models
Hoigi Seo · Dong Un Kang · Hyunjin Cho · Joohoon Lee · Se Young Chun
Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.
R1-ShareVL: Incentivizing Reasoning Capabilities of Multimodal Large Language Models via Share-GRPO
Huanjin Yao · Qixiang Yin · Jingyi Zhang · Min Yang · Yibo Wang · Wenhao Wu · Fei Su · Li Shen · Minghui Qiu · Dacheng Tao · Jiaxing Huang
In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over 6 widely-used reasoning benchmarks showcase the superior performance of our method. Code is available at https://github.com/HJYao00/R1-ShareVL.
Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings
Xingguang Wei · Haomin Wang · Shenglong Ye · Ruifeng Luo · Zhang · Lixin Gu · Jifeng Dai · Yu Qiao · Wenhai Wang · Hongjie Zhang
We study the task of panoptic symbol spotting, which involves identifying both individual instances of countable \textit{things} and the semantic regions of uncountable \textit{stuff} in computer-aided design (CAD) drawings composed of vector graphical primitives. Existing methods typically rely on image rasterization, graph construction, or point-based representation, but these approaches often suffer from high computational costs, limited generality, and loss of geometric structural information. In this paper, we propose \textit{VecFormer}, a novel method that addresses these challenges through \textit{line-based representation} of primitives. This design preserves the geometric continuity of the original primitive, enabling more accurate shape representation while maintaining a computation-friendly structure, making it well-suited for vector graphic understanding tasks. To further enhance prediction reliability, we introduce a \textit{Branch Fusion Refinement} module that effectively integrates instance and semantic predictions, resolving their inconsistencies for more coherent panoptic outputs. Extensive experiments demonstrate that our method establishes a new state-of-the-art, achieving 91.1 PQ, with Stuff-PQ improved by 9.6 and 21.2 points over the second-best results under settings with and without prior information, respectively—highlighting the strong potential of line-based representation as a foundation for vector graphic understanding.
Novel Class Discovery for Point Cloud Segmentation via Joint Learning of Causal Representation and Reasoning
Yang Li · Aming WU · Zihao Zhang · Yahong Han
In this paper, we focus on Novel Class Discovery for Point Cloud Segmentation (3D-NCD), aiming to learn a model that can segment unlabeled (novel) 3D classes using only the supervision from labeled (base) 3D classes. The key to this task is to setup the exact correlations between the point representations and their base class labels, as well as the representation correlations between the points from base and novel classes. A coarse or statistical correlation learning may lead to the confusion in novel class inference. lf we impose a causal relationship as a strong correlated constraint upon the learning process, the essential point cloud representations that accurately correspond to the classes should be uncovered. To this end, we introduce a structural causal model (SCM) to re-formalize the 3D-NCD problem and propose a new method, i.e., Joint Learning of Causal Representation and Reasoning. Specifically, we first analyze hidden confounders in the base class representations and the causal relationships between the base and novel classes through SCM. We devise a causal representation prototype that eliminates confounders to capture the causal representations of base classes. A graph structure is then used to model the causal relationships between the base classes' causal representation prototypes and the novel class prototypes, enabling causal reasoning from base to novel classes. Extensive experiments and visualization results on 3D and 2D NCD semantic segmentation demonstrate the superiorities of our method.
Seg-VAR:Image Segmentation with Visual Autoregressive Modeling
Rongkun Zheng · Lu Qi · Xi Chen · Yi Wang · Kun Wang · Hengshuang Zhao
While visual autoregressive modeling (VAR) strategies have shed light on image generation with the autoregressive models, their potential for segmentation, a task that requires precise low-level spatial perception, remains unexplored. Inspired by the multi-scale modeling of classic Mask2Former-based models, we propose Seg-VAR, a novel framework that rethinks segmentation as a conditional autoregressive mask generation problem. This is achieved by replacing the discriminative learning with the latent learning process. Specifically, our method incorporates three core components: (1) an image encoder generating latent priors from input images, (2) a spatial-aware seglat (a latent expression of segmentation mask) encoder that maps segmentation masks into discrete latent tokens using a location-sensitive color mapping to distinguish instances, and (3) a decoder reconstructing masks from these latents. A multi-stage training strategy is introduced: first learning seglat representations via image-seglat joint training, then refining latent transformations, and finally aligning image-encoder-derived latents with seglat distributions. Experiments show Seg-VAR outperforms previous discriminative and generative methods on various segmentation tasks and validation benchmarks. By framing segmentation as a sequential hierarchical prediction task, Seg-VAR opens new avenues for integrating autoregressive reasoning into spatial-aware vision systems.
OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation
Bo-Wen Yin · Jiao-Long Cao · Xuying Zhang · Yuming Chen · Ming-Ming Cheng · Qibin Hou
Recent research on representation learning has proved the merits of multi-modal clues for robust semantic segmentation. Nevertheless, a flexible pretrain-and-finetune pipeline for multiple visual modalities remains unexplored. In this paper, we propose a novel multi-modal learning framework, termed OmniSegmentor. It has two key innovations: 1) Based on ImageNet, we assemble a large-scale dataset for multi-modal pretraining, called OmniSegmentor, which contains five popular visual modalities; 2) We provide an efficient pretraining manner to endow the model with the capacity to encode different modality information in the OmniSegmentor. For the first time, we introduce a universal multi-modal pretraining framework that consistently amplifies the model's perceptual capabilities across various scenarios, regardless of the arbitrary combination of the involved modalities. Remarkably, our OmniSegmentor achieves new state-of-the-art records on a wide range of multi-modal semantic segmentation datasets, including NYU Depthv2, EventScape, MFNet, DeLiVER, SUNRGBD, and KITTI-360. Data, model checkpoints, and source code will be made publicly available: https://github.com/VCIP-RGBD/DFormer.
UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation
Xiaoqi Zhao · Youwei Pang · Chenyang Yu · Lihe Zhang · Huchuan Lu · Shijian Lu · Georges Fakhri · Xiaofeng Liu
Multi-modal image segmentation faces real-world deployment challenges from incomplete/corrupted modalities degrading performance. While existing methods address training-inference modality gaps via specialized per-combination models, they introduce high deployment costs by requiring exhaustive model subsets and model-modality matching. In this work, we propose a unified modality-relax segmentation network (UniMRSeg) through hierarchical self-supervised compensation (HSSC). Our approach hierarchically bridges representation gaps between complete and incomplete modalities across input, feature and output levels. First, we adopt modality reconstruction with the hybrid shuffled-masking augmentation, encouraging the model to learn the intrinsic modality characteristics and generate meaningful representations for missing modalities through cross-modal fusion. Next, modality-invariant contrastive learning implicitly compensates the feature space distance among incomplete-complete modality pairs. Furthermore, the proposed lightweight reverse attention adapter explicitly compensates for the weak perceptual semantics in the frozen encoder. Last, UniMRSeg is fine-tuned under the hybrid consistency constraint to ensure stable prediction under all modality combinations without large performance fluctuations. Without bells and whistles, UniMRSeg significantly outperforms the state-of-the-art methods under diverse missing modality scenarios on MRI-based brain tumor segmentation, RGB-D semantic segmentation, RGB-D/T salient object segmentation. The code will be released at \url{https://github.com/Xiaoqi-Zhao-DLUT/UniMRSeg}.
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
Hao Tang · Chen-Wei Xie · Haiyang Wang · Xiaoyi Bao · Tingyu Weng · Pandeng Li · Yun Zheng · Liwei Wang
Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task-specific designs and architectures that can complicate the modeling process. To address this challenge, we present UFO, a framework that unifies fine-grained visual perception tasks through an open-ended language interface. By transforming all perception targets into the language space, UFO unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Additionally, we introduce a novel embedding retrieval approach that relies solely on the language interface to support segmentation tasks. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies while achieving comparable or superior performance to methods with intricate task-specific designs. After multi-task training on five standard visual perception datasets, UFO outperforms the previous state-of-the-art generalist models by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining fine-grained perception capabilities with their advanced language abilities, thereby achieving superior performance on the challenging reasoning segmentation. Code and models are available at https://github.com/nnnth/UFO.
OpenHype: Hyperbolic Embeddings for Hierarchical Open-Vocabulary Radiance Fields
Lisa Weijler · Sebastian Koch · Fabio Poiesi · Timo Ropinski · Pedro Hermosilla
Modeling the inherent hierarchical structure of 3D objects and 3D scenes is highly desirable, as it enables a more holistic understanding of environments for autonomous agents. Accomplishing this with implicit representations, such as Neural Radiance Fields, remains an unexplored challenge. Existing methods that explicitly model hierarchical structures often face significant limitations: they either require multiple rendering passes to capture embeddings at different levels of granularity, significantly increasing inference time, or rely on predefined, closed-set discrete hierarchies that generalize poorly to the diverse and nuanced structures encountered by agents in the real world. To address these challenges, we propose OpenHype, a novel approach that represents scene hierarchies using a continuous hyperbolic latent space. By leveraging the properties of hyperbolic geometry, OpenHype naturally encodes multi-scale relationships and enables smooth traversal of hierarchies through geodesic paths in latent space. Our method outperforms state-of-the-art approaches on standard benchmarks, demonstrating superior efficiency and adaptability in 3D scene understanding.
Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning
Junhao Shen · Haiteng Zhao · Yuzhe Gu · Songyang Gao · Kuikun Liu · Haian Huang · Jianfei Gao · Dahua Lin · Wenwei Zhang · Kai Chen
Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms. Extensive experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves InternVL3.0-38B by 8.50\% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e.g., GPT-4.1) on the challenging MathVision and OlympiadBench, achieving 49.08\% and 49.95\% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.
Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation
Xiang Fang · Wanlong Fang · Changshuo Wang
Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the \textbf{Hierarchical Semantic-Augmented Navigation (HSAN)} framework, a groundbreaking approach that redefines VLN-CE through three synergistic innovations. First, HSAN constructs a dynamic hierarchical semantic scene graph, leveraging vision-language models to capture multi-level environmental representations—from objects to regions to zones—enabling nuanced spatial reasoning. Second, it employs an optimal transport-based topological planner, grounded in Kantorovich's duality, to select long-term goals by balancing semantic relevance and spatial accessibility with theoretical guarantees of optimality. Third, a graph-aware reinforcement learning policy ensures precise low-level control, navigating subgoals while robustly avoiding obstacles. By integrating spectral graph theory, optimal transport, and advanced multi-modal learning, HSAN addresses the shortcomings of static maps and heuristic planners prevalent in prior work. Extensive experiments on multiple challenging VLN-CE datasets demonstrate that HSAN achieves state-of-the-art performance, with significant improvements in navigation success and generalization to unseen environments.
PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding
Ansel Blume · Jeonghwan Kim · Hyeonjeong Ha · Elen Chatikyan · Xiaomeng Jin · Khanh Nguyen · Nanyun Peng · Kai-Wei Chang · Derek Hoiem · Heng Ji
Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning—yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 parts and 5346 objects for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY utilizes highly technical concepts and challenges models to compare objects’ parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% gIoU), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we train several part-centric LMMs and propose PLUM, a novel segmenting LMM that utilizes span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM dominates existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data. Our work opens up new avenues towards enabling fine-grained, grounded visual understanding in LMMs.
When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding
Yan Shu · Hangui Lin · Yexin Liu · Yan Zhang · Gangyan Zeng · Yan Li · Yu Zhou · Ser Nam Lim · Harry Yang · Nicu Sebe
Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of 1,740 samples spanning both semantic and non-semantic cases, with manually curated question–answer pairs designed to probe model hallucinations. Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.
VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models
Zhicheng Zhang · Weicheng Wang · Yongjie Zhu · Wenyu Qin · Pengfei Wan · Di ZHANG · Jufeng Yang
Understanding and predicting emotions from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions—characterized by their open-set, dynamic, and context-dependent properties—poses challenge in understanding complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.
SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent
Yandan Yang · Baoxiong Jia · Shujie Zhang · Siyuan Huang
Indoor scene synthesis has become increasingly important with the rise of Embodied AI, which requires 3D environments that are not only visually realistic but also physically plausible and functionally diverse. While recent approaches have advanced visual fidelity, they often remain constrained to fixed scene categories, lack sufficient object-level detail and physical consistency, and struggle to align with complex user instructions. In this work, we present SceneWeaver, a reflective agentic framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement. At its core, SceneWeaver employs a language model-based planner to select from a suite of extensible scene generation tools, ranging from data-driven generative models to visual- and LLM-based methods, guided by self-evaluation of physical plausibility, visual realism, and semantic alignment with user input. This closed-loop reason-act-reflect design enables the agent to identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations. Extensive experiments on both common and open-vocabulary room types demonstrate that \model not only outperforms prior methods on physical, visual, and semantic metrics, but also generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation.
MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
Xinyan Chen · Renrui Zhang · Dongzhi JIANG · Aojun Zhou · Shilin Yan · Weifeng Lin · Hongsheng Li
Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista and +28.78% on GeoQA, respectively.
Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior
Yulin Li · Haokun GUI · Ziyang Fan · Junjie Wang · Bin Kang · BIN CHEN · Zhuotao Tian
Recent advances in Video Large Language Models (VLLMs) have achieved remarkable video understanding capabilities, yet face critical efficiency bottlenecks due to quadratic computational growth with lengthy visual token sequences of long videos. While existing keyframe sampling methods can improve temporal modeling efficiency, additional computational cost is introduced before feature encoding, and the binary frame selection paradigm is found suboptimal. Therefore, in this work, we propose Dynamic Token compression via LLM-guided Keyframe prior (DyToK), a training-free paradigm that enables dynamic token compression by harnessing VLLMs' inherent attention mechanisms. Our analysis reveals that VLLM attention layers naturally encoding query-conditioned keyframe priors, by which DyToK dynamically adjusts per-frame token retention ratios, prioritizing semantically rich frames while suppressing redundancies. Extensive experiments demonstrate that DyToK achieves state-of-the-art efficiency-accuracy tradeoffs. DyToK shows plug-and-play compatibility with existing compression methods, such as VisionZip and FastV, attaining 2.5x faster inference while preserving accuracy across multiple VLLMs, such as LLaVA-OneVision and Qwen2.5-VL. Code and models will be made publicly available.
Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs
Kejia Zhang · Keda TAO · Jiasheng Tang · Huan Wang
Large vision-language models (LVMs) extend large language models (LLMs) with visual perception capabilities, enabling them to process and interpret visual information. A major challenge compromising their reliability is object hallucination that LVMs may generate plausible but factually inaccurate information. We propose a novel \textit{visual adversarial perturbation (VAP)} method to mitigate this hallucination issue. VAP alleviates LVM hallucination by applying strategically optimized visual noise without altering the base model. Our approach formulates hallucination suppression as an optimization problem, leveraging adversarial strategies to generate beneficial visual perturbations that enhance the model's factual grounding and reduce parametric knowledge bias. Extensive experimental results demonstrate that our method consistently reduces object hallucinations across 8 state-of-the-art LVMs, validating its efficacy across diverse evaluations.
Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos
Weifeng Lin · Xinyu Wei · Ruichuan An · Tianhe Ren · Tingwei Chen · Renrui Zhang · Ziyu Guo · Wentao Zhang · Lei Zhang · Hongsheng Li
We present Perceive Anything Model (PAM), a conceptually straightforward and efficient framework for comprehensive region-level visual understanding in images and videos. Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region-specific semantic outputs, including categories, label definition, functional explanations, and detailed captions. A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features, which inherently carry general vision, localization, and semantic priors into multi-modal tokens for LLM comprehension. To support robust multi-granularity understanding, we also develop a dedicated data refinement and augmentation pipeline, yielding a high-quality dataset of 1.5M image and 0.6M video region-semantic annotations, including novel region-level streaming video caption data. PAM is designed for lightweightness and efficiency, while also demonstrates strong performance across a diverse range of region understanding tasks. It runs 1.2$-$2.4$\times$ faster and consumes less GPU memory than prior approaches, offering a practical solution for real-world applications. We believe that our effective approach will serve as a strong baseline for future research in region-level visual understanding.
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Huanjin Yao · Jiaxing Huang · Wenhao Wu · Jingyi Zhang · Yibo Wang · Shunyu Liu · Yingjie Wang · YuXin Song · Haocheng Feng · Li Shen · Dacheng Tao
In this work, we aim to develop an MLLM that understands and solves questions by learning to create each intermediate step of the reasoning involved till the final answer. To this end, we propose Collective Monte Carlo Tree Search (CoMCTS), a new learning-to-reason method for MLLMs, which introduces the concept of collective learning into ``tree search'' for effective and efficient reasoning-path searching and learning. The core idea of CoMCTS is to leverage collective knowledge from multiple models to collaboratively conjecture, search and identify effective reasoning paths toward correct answers via four iterative operations including Expansion, Simulation and Error Positioning, Backpropagation, and Selection. Using CoMCTS, we construct Mulberry-260k, a multimodal dataset with a tree of rich, explicit and well-defined reasoning nodes for each question. With Mulberry-260k, we perform collective SFT to train our model, Mulberry, a series of MLLMs with o1-like step-by-step Reasoning and Reflection capabilities. Extensive experiments demonstrate the superiority of our proposed methods on various benchmarks. Code is available at https://github.com/HJYao00/Mulberry.
Find your Needle: Small Object Image Retrieval via Multi-Object Attention Optimization
Michael Green · Matan Levy · Issar Tzachor · Dvir Samuel · Nir Darshan · Rami Ben-Ari
We address the challenge of Small Object Image Retrieval (SoIR), where the goal is to retrieve images containing a specific small object, in a cluttered scene. The key challenge in this setting is constructing a single image descriptor, for scalable and efficient search, that effectively represents all objects in the image. In this paper, we first analyze the limitations of existing methods on this challenging task and then introduce new benchmarks to support SoIR evaluation. Next, we introduce \ours (\oursMI), a novel retrieval framework which incorporates a dedicated multi-object pre-training phase. This is followed by a refinement process that leverages attention-based feature extraction with object masks, integrating them into a single unified image descriptor. Our \oursMI approach significantly outperforms existing retrieval methods and strong baselines, achieving notable improvements in both zero-shot and lightweight multi-object fine-tuning. We hope this work will pave the way and inspire further research to enhance retrieval performance for this highly practical task.
Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM
Zinuo Li · Xian Zhang · Yongxin Guo · Mohammed Bennamoun · Farid Boussaid · Girish Dwivedi · Luqi Gong · Qiuhong Ke
Humans naturally understand moments in a video by integrating visual and auditory cues. For example, localizing a scene in the video like “A scientist passionately speaks on wildlife conservation as dramatic orchestral music plays, with the audience nodding and applauding” requires simultaneous processing of visual, audio, and speech signals. However, existing models often struggle to effectively fuse and interpret audio information, limiting their capacity for comprehensive video temporal understanding. To address this, we present TriSense, a triple-modality large language model designed for holistic video temporal understanding through the integration of visual, audio, and speech modalities. Central to TriSense is a Query-Based Connector that adaptively reweights modality contributions based on the input query, enabling robust performance under modality dropout and allowing flexible combinations of available inputs. To support TriSense's multimodal capabilities, we introduce TriSense-2M, a high-quality dataset of over 2 million curated samples generated via an automated pipeline powered by fine-tuned LLMs. TriSense-2M includes long-form videos and diverse modality combinations, facilitating broad generalization. Extensive experiments across multiple benchmarks demonstrate the effectiveness of TriSense and its potential to advance multimodal video analysis.
Statistical Analysis of an Adversarial Bayesian Weak Supervision Method
Steven An
Programmatic Weak Supervision (PWS) aims to reduce the cost of constructing large high quality labeled datasets often used in training modern machine learning models. A major component of the PWS pipeline is the label model, which amalgamates predictions from multiple noisy weak supervision sources, i.e. labeling functions (LFs), to label datapoints. While most label models are either probabilistic or adversarial, a recently proposed label model achieves strong empirical performance without falling into either camp. That label model constructs a polytope of plausible labelings using the LF predictions and outputs the "center" of that polytope as its proposed labeling. In this paper, we attempt to theoretically study that strategy by proposing Bayesian Balsubramani-Freund (BBF), a label model that implicitly constructs a polytope of plausible labelings and selects a labeling in its interior. We show an assortment of statistical results for BBF: log-concavity of its posterior, its form of solution, consistency, and rates of convergence. Extensive experiments compare our proposed method against twelve baseline label models over eleven datasets. BBF compares favorably to other Bayesian label models and label models that don't use datapoint features -- matching or exceeding their performance on eight out of eleven datasets.
Stratify or Die: Rethinking Data Splits in Image Segmentation
Naga Venkata Sai Jitin Jami · Thomas Altstidl · Jonas Mueller · Jindong Li · Dario Zanca · Bjoern Eskofier · Heike Leutheuser
Random splitting of datasets in image segmentation often leads to unrepresentative test sets, resulting in biased evaluations and poor model generalization. While stratified sampling has proven effective for addressing label distribution imbalance in classification tasks, extending these ideas to segmentation remains challenging due to the multi-label structure and class imbalance typically present in such data. Building on existing stratification concepts, we introduce Iterative Pixel Stratification (IPS), a straightforward, label-aware sampling method tailored for segmentation tasks. Additionally, we present Wasserstein-Driven Evolutionary Stratification (WDES), a novel genetic algorithm designed to minimize the Wasserstein distance, thereby optimizing the similarity of label distributions across dataset splits. We prove that WDES is globally optimal given enough generations. Using newly proposed statistical heterogeneity metrics, we evaluate both methods against random sampling and find that WDES consistently produces more representative splits. Applying WDES across diverse segmentation tasks, including street scenes, medical imaging, and satellite imagery, leads to lower performance variance and improved model evaluation. Our results also highlight the particular value of WDES in handling small, imbalanced, and low-diversity datasets, where conventional splitting strategies are most prone to bias.
ARGenSeg: Image Segmentation with Autoregressive Image Generation Model
Xiaolong Wang · Lixiang Ru · Ziyuan Huang · Kaixiang Ji · DanDan Zheng · Jingdong Chen · Jun Zhou
We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.
Segment Anything Model Meets Semi-supervised Medical Image Segmentation: A Novel Perspective
Haifeng Zhao · Haiyang Li · Lei-Lei Ma · Dengdi Sun
The scarcity of annotated medical imaging data has driven significant progress in semi-supervised learning to alleviate reliance on expensive expert labeling. While foundational vision models such as the Segment Anything Model (SAM) exhibit robust generalization in generic segmentation tasks, their direct application to medical images often results in suboptimal performance. To address this challenge, in this work, we propose a novel fully SAM-based semi-supervised medical image segmentation framework and develop the corresponding knowledge distillation-based learning strategy. Specifically, we first employ an efficient SAM variant as the backbone network of the semi‑supervised framework and update the default prompt embedding of SAM to unleash its full potential. Then, we utilize an original SAM, which is rich in prior knowledge, as the teacher to optimize our efficient student SAM backbone through hierarchical knowledge distillation and a dynamic loss weighting strategy. Extensive experiments on various medical datasets demonstrate that our method outperforms state-of-the-art semi-supervised segmentation approaches. Especially, our model requires less than 10% of the parameter size of the original SAM, enabling substantially lower deployment and storage overhead in real-world clinical settings.
Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention
Haijing Liu · Zhiyuan Song · Hefeng Wu · Tao Pu · Keze Wang · Liang Lin
Egocentric Referring Video Object Segmentation (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos. This task is critical for understanding egocentric human behavior. However, achieving such segmentation robustly is challenging due to ambiguities inherent in egocentric videos and biases present in training data. Consequently, existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets and fundamental visual confounding factors of the egocentric perspective, such as rapid motion and frequent occlusions. To address these limitations, we introduce Causal Ego-REferring Segmentation (CERES), a plug-in causal framework that adapts strong, pre-trained RVOS backbones to the egocentric domain. CERES implements dual-modal causal intervention: applying backdoor adjustment principles to counteract language representation biases learned from dataset statistics, and leveraging front-door adjustment concepts to address visual confounding by intelligently integrating semantic visual features with geometric depth information guided by causal principles, creating representations more robust to egocentric distortions. Extensive experiments demonstrate that CERES achieves state-of-the-art performance on Ego-RVOS benchmarks, highlighting the potential of applying causal reasoning to build more reliable models for broader egocentric video understanding.
OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection
Zhongyu Xia · Jishuo Li · Zhiwei Lin · Xinhao Wang · Yongtao Wang · Ming-Hsuan Yang
Open-world perception aims to develop a model adaptable to novel domains and various sensor configurations and can understand uncommon objects and corner cases. However, current research lacks sufficiently comprehensive open-world 3D perception benchmarks and robust generalizable methodologies. This paper introduces OpenAD, the first real open-world autonomous driving benchmark for 3D object detection. OpenAD is built upon a corner case discovery and annotation pipeline that integrates with a multimodal large language model (MLLM). The proposed pipeline annotates corner case objects in a unified format for five autonomous driving perception datasets with 2000 scenarios. In addition, we devise evaluation methodologies and evaluate various open-world and specialized 2D and 3D models. Moreover, we propose a vision-centric 3D open-world object detection baseline and further introduce an ensemble method by fusing general and specialized models to address the issue of lower precision in existing open-world methods for the OpenAD benchmark. We host an online challenge on EvalAI. Data, toolkit codes, and evaluation codes are available at https://github.com/VDIGPKU/OpenAD.
MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval
Huaying Yuan · Jian Ni · Zheng Liu · Yueze Wang · Junjie Zhou · Zhengyang Liang · Bo Zhao · Zhao Cao · Ji-Rong Wen · Zhicheng Dou
Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and task diversity, or they focus solely on the end-to-end LVU performance, making them inappropriate for evaluating whether key moments can be accurately accessed. To address this challenge, we propose MomentSeeker, a novel benchmark for long-video moment retrieval (LVMR), distinguished by the following features. First, it is created based on long and diverse videos, averaging over 1,200 seconds in duration, and collected from various domains, e.g., movie, anomaly, egocentric, and sports. Second, it covers a variety of real-world scenarios in three levels: global-level, event-level, and object-level, covering common tasks like action recognition, object localization, causal reasoning, etc. Third, it incorporates rich forms of queries, including text-only queries, image-conditioned queries, and video-conditioned queries. On top of MomentSeeker, we conduct comprehensive experiments for both generation-based approaches (directly using MLLMs) and retrieval-based approaches (leveraging video retrievers). Our results reveal the significant challenges in long-video moment retrieval in terms of accuracy and efficiency, despite improvements from the latest long-video MLLMs and task-specific fine-tuning. We have publicly released MomentSeeker to facilitate future research in this area.
ReinAD: Towards Real-world Industrial Anomaly Detection with a Comprehensive Contrastive Dataset
Xu Wang · Jingyuan Zhuo · Zhiyuan You · Zhiyu Tan · Yikuan Yu · Siyu Wang · Xinyi Le
Recent years have witnessed significant advancements in industrial anomaly detection (IAD) thanks to existing anomaly detection datasets. However, the large performance gap between these benchmarks and real industrial practice reveals critical limitations in existing datasets. We argue that the mismatch between current datasets and real industrial scenarios becomes the primary barrier to practical IAD deployment. To this end, we propose ReinAD dataset, a comprehensive contrastive dataset towards Real-world industrial Anomaly Detection. Our dataset prioritizes three critical real-world requirements: 1) Contrast-based anomaly definition that is essential for industrial practice, 2) Fine-grained unaligned image pairs reflecting real inspections, and 3) Large-scale data from active production lines spanning multiple industrial categories. Based on our dataset, we introduce the ReinADNet. It takes both normal reference and test images as inputs, achieving anomaly detection through normal-anomaly comparison. To address the fine-grained and unaligned properties of real industrial scenes, our method integrates pyramidal similarity aggregation for comprehensive anomaly characterization and global-local feature fusion for spatial misalignment tolerance. Our method outperforms all baselines on the ReinAD dataset (e.g., 64.5% v.s. 59.5% in 1-shot image-level AP) under all settings. Extensive experiments across several datasets demonstrate our dataset's challenging nature and our method's superior generalization. This work provides a solid foundation for practical industrial anomaly detection. Dataset and code are available at https://tocmac.github.io/ReinAD.
SGAR: Structural Generative Augmentation for 3D Human Motion Retrieval
Jiahang Zhang · Lilang Lin · Shuai Yang · Jiaying Liu
3D human motion-text retrieval is essential for accurate motion understanding, targeted at cross-modal alignment learning. Existing methods typically align the global motion-text concepts directly, suffering from sub-optimal generalization due to the uncertainty of correspondence learning between multiple motion concepts coupled in a single motion/text sequence. Therefore, we study the explicit fine-grained concept decomposition for alignment learning and present a novel framework, Structural Generative Augmentation for 3D Human Motion Retrieval (SGAR), to enable generation-augmented retrieval. Specifically, relying on the strong priors of existing large language model (LLM) assets, we effectively decompose human motions structurally into subtler semantic units, \ie, body parts, for fine-grained motion modeling. Based on this, we develop part-mixture learning to better decouple the local motion concept learning, boosting part-level alignment. Moreover, a directional relation alignment strategy exploiting the correspondence between full-body and part motions is incorporated to regularize feature manifold for better consistency. Extensive experiments on three benchmarks, including motion-text retrieval as well as recognition and generation applications, demonstrate the superior performance and promising transferability of our method.
Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning
Xiaomeng Fan · Yuchuan Mao · Zhi Gao · Yuwei Wu · Jin Chen · Yunde Jia
Open-vocabulary learning requires modeling the data distribution in open environments, which consists of both seen-class and unseen-class data. Existing methods estimate the distribution in open environments using seen-class data, where the absence of unseen classes makes the estimation error inherently unidentifiable. Intuitively, learning beyond the seen classes is crucial for distribution estimation to bound the estimation error. We theoretically demonstrate that the distribution can be effectively estimated by generating unseen-class data, through which the estimation error is upper-bounded. Building on this theoretical insight, we propose a novel open-vocabulary learning method, which generates unseen-class data for estimating the distribution in open environments. The method consists of a class-domain-wise data generation pipeline and a distribution alignment algorithm. The data generation pipeline generates unseen-class data under the guidance of a hierarchical semantic tree and domain information inferred from the seen-class data, facilitating accurate distribution estimation. With the generated data, the distribution alignment algorithm estimates and maximizes the posterior probability to enhance generalization in open-vocabulary learning. Extensive experiments on 11 datasets demonstrate that our method outperforms baseline approaches by up to 14%, highlighting its effectiveness and superiority.
Towards Single-Source Domain Generalized Object Detection via Causal Visual Prompts
Chen Li · Huiying Xu · Changxin Gao · Zeyu Wang · Yun Liu · Xinzhong Zhu
Single-source Domain Generalized Object Detection (SDGOD), as a cutting-edge research topic in computer vision, aims to enhance model generalization capability in unseen target domains through single-source domain training. Current mainstream approaches attempt to mitigate domain discrepancies via data augmentation techniques. However, due to domain shift and limited domain‑specific knowledge, models tend to fall into the pitfall of spurious correlations. This manifests as the model's over-reliance on simplistic classification features (e.g., color) rather than essential domain-invariant representations like object contours. To address this critical challenge, we propose the Cauvis (Causal Visual Prompts) method. First, we introduce a Cross-Attention Prompts module that mitigates bias from spurious features by integrating visual prompts with cross-attention. To address the inadequate domain knowledge coverage and spurious feature entanglement in visual prompts for single-domain generalization, we propose a dual-branch adapter that disentangles causal-spurious features while achieving domain adaptation via high-frequency feature extraction. Cauvis achieves state-of-the-art performance with 15.9–31.4\% gains over existing domain generalization methods on SDGOD datasets, while exhibiting significant robustness advantages in complex interference environments.
Collaborative and Confidential Junction Trees for Hybrid Bayesian Networks
Roberto Gheda · Abele Mălan · Thiago Guzella · Carlo Lancia · Robert Birke · Lydia Chen
Bayesian Network models are a powerful tool to collaboratively optimize production processes in various manufacturing industries. When interacting, collaborating parties must preserve their business secrets by maintaining the confidentiality of their model structures and parameters. While most realistic industry scenarios involve hybrid settings, handling both discrete and continuous data, current state-of-the-art methods for collaborative and confidential inference only support discrete data and have high communication costs. In a centralized setting, Junction Trees enable efficient inference even in hybrid scenarios without discretizing continuous variables, but no extension for collaborative and confidential scenarios exists. To address this research gap, we introduce Hybrid CCJT, the first framework for confidential multiparty inference in hybrid domains with semi-honest, non-colluding adversaries, comprising: (i) a method to construct a strongly-rooted Junction Tree across collaborating parties through a novel construct of interface cliques; and, (ii) a protocol for confidential inference built upon multiparty computation primitives comprising a one-time alignment phase and a belief propagation system for combining the inference results across the Junction Tree cliques. Extensive evaluation on nine datasets shows that Hybrid CCJT improves the predictive accuracy of continuous target variables by 32% on average compared to the state-of-the-art, while reducing communication costs by a median 10.4x under purely discrete scenarios.
DAMamba: Vision State Space Model with Dynamic Adaptive Scan
Tanzhe Li · Caoshuo Li · Jiayi Lyu · Hongjuan Pei · Baochang Zhang · Taisong Jin · Rongrong Ji
State space models (SSMs) have recently garnered significant attention in computer vision. However, due to the unique characteristics of image data, adapting SSMs from natural language processing to computer vision has not outperformed the state-of-the-art convolutional neural networks (CNNs) and Vision Transformers (ViTs). Existing vision SSMs primarily leverage manually designed scans to flatten image patches into sequences locally or globally. This approach disrupts the original semantic spatial adjacency of the image and lacks flexibility, making it difficult to capture complex image structures. To address this limitation, we propose Dynamic Adaptive Scan (DAS), a data-driven method that adaptively allocates scanning orders and regions. This enables more flexible modeling capabilities while maintaining linear computational complexity and global modeling capacity. Based on DAS, we further propose the vision backbone DAMamba, which significantly outperforms popular vision Mamba models in vision tasks such as image classification, object detection, instance segmentation, and semantic segmentation. Notably, it surpasses some of the latest state-of-the-art CNNs and ViTs.
Learning Dense Hand Contact Estimation from Imbalanced Data
Daniel Jung · Kyoung Mu Lee
Hands are essential to human interaction, and exploring contact between hands and the world can promote comprehensive understanding of their function. Recently, there have been growing number of hand interaction datasets that cover interaction with object, other hand, scene, and body. Despite the significance of the task and increasing high-quality data, how to effectively learn dense hand contact estimation remains largely underexplored. There are two major challenges for learning dense hand contact estimation. First, there exists class imbalance issue from hand contact datasets where majority of regions are not in contact. Second, hand contact datasets contain spatial imbalance issue with most of hand contact exhibited in finger tips, resulting in challenges for generalization towards contacts in other hand regions. To tackle these issues, we present a framework that learns dense HAnd COntact estimation (HACO) from imbalanced data. To resolve the class imbalance issue, we introduce balanced contact sampling, which builds and samples from multiple sampling groups that fairly represent diverse contact statistics for both contact and non-contact vertices. Moreover, to address the spatial imbalance issue, we propose vertex-level class-balanced (VCB) loss, which incorporates spatially varying contact distribution by separately reweighting loss contribution of each vertex based on its contact frequency across dataset. As a result, we effectively learn to predict dense hand contact estimation with large-scale hand contact data without suffering from class and spatial imbalance issue. The codes are available at https://github.com/dqj5182/HACO_RELEASE.
VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning
Wenhao Li · Qiangchang Wang · Xianjing Meng · Zhibin Wu · Yilong Yin
Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information (e.g., class descriptions) or designing complex semantic fusion modules. However, these methods still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment mechanism. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.
TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning
Xudong Yan · Songhe Feng
Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual knowledge from historical images for inference. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. Code will be available at https://github.com/xud-yan/TOMCAT.
What We Miss Matters: Learning from the Overlooked in Point Cloud Transformers
Yi Wang · Jiaze Wang · Ziyu Guo · Renrui Zhang · Donghao Zhou · Guangyong Chen · Anfeng Liu · Pheng-Ann Heng
Point Cloud Transformers have become a cornerstone in 3D representation for their ability to model long-range dependencies via self-attention. However, these models tend to overemphasize salient regions while neglecting other informative regions, which limits feature diversity and compromises robustness. To address this challenge, we introduce BlindFormer, a novel contrastive attention learning framework that redefines saliency by explicitly incorporating features typically neglected by the model. The proposed Attentional Blindspot Mining (ABM) suppresses highly attended regions during training, thereby guiding the model to explore its own blind spots. This redirection of attention expands the model’s perceptual field and uncovers richer geometric cues. To consolidate these overlooked features, BlindFormer employs Blindspot-Aware Joint Optimization (BJO), a joint learning objective that integrates blindspot feature alignment with the original pretext task. BJO enhances feature discrimination while preserving performance on the primary task, leading to more robust and generalizable representations. We validate BlindFormer on several challenging benchmarks and demonstrate consistent performance gains across multiple Transformer backbones. Notably, it improves Point-MAE by +13.4\% and PointGPT-S by +6.3\% on OBJ-BG under Gaussian noise. These results highlight the importance of mitigating attentional biases in 3D representation learning, revealing BlindFormer’s superior ability to handle perturbations and improve feature discrimination.
ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression
Tom Burgert · Oliver Stoll · Paolo Rota · Begum Demir
The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.
Prompt-guided Disentangled Representation for Action Recognition
tianci wu · Guangming Zhu · Lu jiang · Siyuan Wang · Ning Wang · Nuoye Xiong · Liang Zhang
Action recognition is a fundamental task in video understanding. Existing methods typically extract unified features to process all actions in one video, which makes it challenging to model the interactions between different objects in multi-action scenarios. To alleviate this issue, we explore disentangling any specified actions from complex scenes as an effective solution. In this paper, we propose Prompt-guided Disentangled Representation for Action Recognition (ProDA), a novel framework that disentangles any specified actions from a multi-action scene. ProDA leverages Spatio-temporal Scene Graphs (SSGs) and introduces Dynamic Prompt Module (DPM) to guide a Graph Parsing Neural Network (GPNN) in generating action-specific representations. Furthermore, we design a video-adapted GPNN that aggregates information using dynamic weights. Extensive experiments on two complex video action datasets, Charades and SportsHHI, demonstrate the effectiveness of our approach against state-of-the-art methods. Our code can be found in https://github.com/iamsnaping/ProDA.git.
SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding
Zhao Jin · Rong-Cheng Tu · Jingyi Liao · Wenhao Sun · Xiao Luo · Shunyu Liu · Dacheng Tao
3D Visual Grounding (3DVG) aims to localize target objects within a 3D scene based on natural language queries. To alleviate the reliance on costly 3D training data, recent studies have explored zero-shot 3DVG by leveraging the extensive knowledge and powerful reasoning capabilities of pre-trained LLMs and VLMs. However, existing paradigms tend to emphasize either spatial (3D-based) or semantic (2D-based) understanding, limiting their effectiveness in complex real-world applications. In this work, we introduce SPAZER — a VLM-driven agent that combines both modalities in a progressive reasoning framework. It first holistically analyzes the scene and produces a 3D rendering from the optimal viewpoint. Based on this, anchor-guided candidate screening is conducted to perform a coarse-level localization of potential objects. Furthermore, leveraging retrieved relevant 2D camera images, 3D-2D joint decision-making is efficiently performed to determine the best-matching object. By bridging spatial and semantic reasoning neural streams, SPAZER achieves robust zero-shot grounding without training on 3D-labeled data. Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate that SPAZER significantly outperforms previous state-of-the-art zero-shot methods, achieving notable gains of $\mathbf{9.0\}$% and $\mathbf{10.9\}$% in accuracy.
PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer
Zhiwei Yang · Chen Gao · Mike Zheng Shou
Video anomaly detection (VAD) is a critical yet challenging task due to the complex and diverse nature of real-world scenarios. Previous methods typically rely on domain-specific training data and manual adjustments when applying to new scenarios and unseen anomaly types, suffering from high labor costs and limited generalization. Therefore, we aim to achieve generalist VAD, \ie, automatically handle any scene and any anomaly types without training data or human involvement. In this work, we propose PANDA, an agentic AI engineer based on MLLMs. Specifically, we achieve PANDA by comprehensively devising four key capabilities: (1) self-adaptive scene-aware strategy planning, (2) goal-driven heuristic reasoning, (3) tool-augmented self-reflection, and (4) self-improving chain-of-memory. Concretely, we develop a self-adaptive scene-aware RAG mechanism, enabling PANDA to retrieve anomaly-specific knowledge for anomaly detection strategy planning. Next, we introduce a latent anomaly-guided heuristic prompt strategy to enhance reasoning precision. Furthermore, PANDA employs a progressive reflection mechanism alongside a suite of context-aware tools to iteratively refine decision-making in complex scenarios. Finally, a chain-of-memory mechanism enables PANDA to leverage historical experiences for continual performance improvement. Extensive experiments demonstrate that PANDA achieves state-of-the-art performance in multi-scenario, open-set, and complex scenario settings without training and manual involvement, validating its generalizable and robust anomaly detection capability. Code is released at https://github.com/showlab/PANDA.
Learning Skill-Attributes for Transferable Assessment in Video
Kumar Ashutosh · Kristen Grauman
Skill assessment from video entails rating the quality of a person’s physical performance and explaining what could be done better. Today’s models specialize for an individual sport, and suffer from the high cost and scarcity of expert-level supervision across the long tail of sports. Towards closing that gap, we explore transferable video representations for skill assessment. Our CrossTrainer approach discovers skill-attributes—such as balance, control, and hand positioning—whose meaning transcends the boundaries of any given sport, then trains a multimodal language model to generate actionable feedback for a novel video, e.g., “lift hands more to generate more power” as well as its proficiency level, e.g., early expert. We validate the new model on multiple datasets for both cross-sport (transfer) and intra-sport (in-domain) settings, where it achieves gains up to 60% relative to the state of the art. By abstracting out the shared behaviors indicative of human skill, the proposed video representation generalizes substantially better than an array of existing techniques, enriching today’s multimodal large language models.
Geometry-Aware Collaborative Multi-Solutions Optimizer for Model Fine-Tuning with Parameter Efficiency
Van-Anh Nguyen · Trung Le · Mehrtash Harandi · Ehsan Abbasnejad · Thanh-Toan Do · Dinh Phung
We propose a framework grounded in gradient flow theory and informed by geometric structure that provides multiple diverse solutions for a given task, ensuring collaborative results that enhance performance and adaptability across different tasks. This framework enables flexibility, allowing for efficient task-specific fine-tuning while preserving the knowledge of the pre-trained foundation models. Extensive experiments across transfer learning, few-shot learning, and domain generalization show that our proposed approach consistently outperforms existing Bayesian methods, delivering strong performance with affordable computational overhead and offering a practical solution by updating only a small subset of parameters.
Epistemic Uncertainty Estimation in Regression Ensemble Models with Pairwise Epistemic Estimators
Lucas Berry · David Meger
This work introduces a novel approach, Pairwise Epistemic Estimators (PairEpEsts), for epistemic uncertainty estimation in ensemble models for regression tasks using pairwise-distance estimators (PaiDEs). By utilizing the pairwise distances between model components, PaiDEs establish bounds on entropy. We leverage this capability to enhance the performance of Bayesian Active Learning by Disagreement (BALD). Notably, unlike sample-based Monte Carlo estimators, PairEpEsts can estimate epistemic uncertainty up to 100 times faster and demonstrate superior performance in higher dimensions. To validate our approach, we conducted a varied series of regression experiments on commonly used benchmarks: 1D sinusoidal data, Pendulum, Hopper, Ant, and Humanoid, demonstrating PairEpEsts’ advantage over baselines in high-dimensional regression active learning.
Multilevel neural simulation-based inference
Yuga Hikida · Ayush Bharti · Niall Jeffrey · Francois-Xavier Briol
Neural simulation-based inference (SBI) is a popular set of methods for Bayesian inference when models are only available in the form of a simulator. These methods are widely used in the sciences and engineering, where writing down a likelihood can be significantly more challenging than constructing a simulator. However, the performance of neural SBI can suffer when simulators are computationally expensive, thereby limiting the number of simulations that can be performed. In this paper, we propose a novel approach to neural SBI which leverages multilevel Monte Carlo techniques for settings where several simulators of varying cost and fidelity are available. We demonstrate through both theoretical analysis and extensive experiments that our method can significantly enhance the accuracy of SBI methods given a fixed computational budget.
Distance-informed Neural Processes
Aishwarya Venkataramanan · Joachim Denzler
We propose the Distance-informed Neural Process (DNP), a novel variant of Neural Processes that improves uncertainty estimation by combining global and distance-aware local latent structures. Standard Neural Processes (NPs) often rely on a global latent variable and struggle with uncertainty calibration and capturing local data dependencies. DNP addresses these limitations by introducing a global latent variable to model task-level variations and a local latent variable to capture input similarity within a distance-preserving latent space. This is achieved through bi-Lipschitz regularization, which bounds distortions in input relationships and encourages the preservation of relative distances in the latent space. This modeling approach allows DNP to produce better-calibrated uncertainty estimates and more effectively distinguish in- from out-of-distribution data. Empirical results demonstrate that DNP achieves strong predictive performance and improved uncertainty calibration across regression and classification tasks.
Temporal-Difference Variational Continual Learning
Luckeciano Carvalho Melo · Alessandro Abate · Yarin Gal
Machine Learning models in real-world applications must continuously learn new tasks to adapt to shifts in the data-generating distribution. Yet, for Continual Learning (CL), models often struggle to balance learning new tasks (plasticity) with retaining previous knowledge (memory stability). Consequently, they are susceptible to Catastrophic Forgetting, which degrades performance and undermines the reliability of deployed systems. In the Bayesian CL literature, variational methods tackle this challenge by employing a learning objective that recursively updates the posterior distribution while constraining it to stay close to its previous estimate. Nonetheless, we argue that these methods may be ineffective due to compounding approximation errors over successive recursions. To mitigate this, we propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations, preventing individual errors from dominating future posterior updates and compounding over time. We reveal insightful connections between these objectives and Temporal-Difference methods, a popular learning mechanism in Reinforcement Learning and Neuroscience. Experiments on challenging CL benchmarks show that our approach effectively mitigates Catastrophic Forgetting, outperforming strong Variational CL methods.
Optimal Adjustment Sets for Nonparametric Estimation of Weighted Controlled Direct Effect
Ruiyang Lin · Yongyi Guo · Kyra Gan
The weighted controlled direct effect (WCDE) generalizes the standard controlled direct effect (CDE) by averaging over the mediator distribution, providing a robust estimate when treatment effects vary across mediator levels. This makes the WCDE especially relevant in fairness analysis, where it isolates the direct effect of an exposure on an outcome, independent of mediating pathways. In this work, we first establish necessary and sufficient conditions for the unique identifiability of the WCDE, clarifying when it diverges from the CDE. Next, we derive the efficient influence function for the WCDE and consider the class of regular and asymptotically linear estimators. We characterize the optimal covariate adjustment set that minimizes asymptotic variance, demonstrating how mediator-confounder interactions introduce distinct requirements compared to average treatment effect estimation. Our results offer a principled framework for efficient estimation of direct effects in complex causal systems, with practical applications in fairness and mediation analysis.
Synthetic-powered predictive inference
Meshi Bashari · Roy Maor Lotan · Yonghoon Lee · Edgar Dobriban · Yaniv Romano
Conformal prediction is a framework for predictive inference with a distribution-free, finite-sample guarantee. However, it tends to provide uninformative prediction sets when calibration data are scarce. This paper introduces Synthetic-powered predictive inference (SPI), a novel framework that incorporates synthetic data---e.g., from a generative model---to improve sample efficiency. At the core of our method is a score transporter: an empirical quantile mapping that aligns nonconformity scores from trusted, real data with those from synthetic data. By carefully integrating the score transporter into the calibration process, SPI provably achieves finite-sample coverage guarantees without making any assumptions about the real and synthetic data distributions. When the score distributions are well aligned, SPI yields substantially tighter and more informative prediction sets than standard conformal prediction. Experiments on image classification---augmenting data with synthetic diffusion-model generated images---and on tabular regression demonstrate notable improvements in predictive efficiency in data-scarce settings.
Effortless, Simulation-Efficient Bayesian Inference using Tabular Foundation Models
Julius Vetter · Manuel Gloeckler · Daniel Gedon · Jakob H Macke
Simulation-based inference (SBI) offers a flexible and general approach to performing Bayesian inference: In SBI, a neural network is trained on synthetic data simulated from a model and used to rapidly infer posterior distributions for observed data. A key goal for SBI is to achieve accurate inference with as few simulations as possible, especially for expensive simulators. In this work, we address this challenge by repurposing recent probabilistic foundation models for tabular data: We show how tabular foundation models---specifically TabPFN---can be used as pre-trained autoregressive conditional density estimators for SBI. We propose Neural Posterior Estimation with Prior-data Fitted Networks (NPE-PFN) and show that it is competitive with current SBI approaches in terms of accuracy for both benchmark tasks and two complex scientific inverse problems. Crucially, it often substantially outperforms them in terms of simulation efficiency, sometimes requiring orders of magnitude fewer simulations. NPE-PFN eliminates the need for selecting and training an inference network and tuning its hyperparameters. We also show that it exhibits superior robustness to model misspecification and can be scaled to simulation budgets that exceed the context size limit of TabPFN. NPE-PFN provides a new direction for SBI, where training-free, general-purpose inference models offer efficient, easy-to-use, and flexible solutions for a wide range of stochastic inverse problems.
Solving and Learning Partial Differential Equations with Variational Q-Exponential Processes
Guangting Yu · Shiwei Lan
Solving and learning partial differential equations (PDEs) lies at the core of physics-informed machine learning. Traditional numerical methods, such as finite difference and finite element approaches, are rooted in domain-specific techniques and often lack scalability. Recent advances have introduced neural networks and Gaussian processes (GPs) as flexible tools for automating PDE solving and incorporating physical knowledge into learning frameworks. While GPs offer tractable predictive distributions and a principled probabilistic foundation, they may be suboptimal in capturing complex behaviors such as sharp transitions or non-smooth dynamics. To address this limitation, we propose the use of the Q-exponential process (Q-EP), a recently developed generalization of GPs designed to better handle data with abrupt changes and to more accurately model derivative information. We advocate for Q-EP as a superior alternative to GPs in solving PDEs and associated inverse problems. Leveraging sparse variational inference, our method enables principled uncertainty quantification -- a capability not naturally afforded by neural network-based approaches. Through a series of experiments, including the Eikonal equation, Burgers’ equation, and an inverse Darcy flow problem, we demonstrate that the variational Q-EP method consistently yields more accurate solutions while providing meaningful uncertainty estimates.
GOATex: Geometry & Occlusion-Aware Texturing
Hyunjin Kim · Kunho Kim · Adam Lee · Wonkwang Lee
We present GOATex, a diffusion-based method for 3D mesh texturing that generates high-quality textures for both exterior and interior surfaces. While existing methods perform well on visible regions, they inherently lack mechanisms to handle occluded interiors, resulting in incomplete textures and visible seams. To address this, we introduce an occlusion-aware texturing framework based on the concept of hit levels, which quantify the relative depth of mesh faces via multi-view ray casting. This allows us to partition mesh faces into ordered visibility layers, from outermost to innermost. We then apply a two-stage visibility control strategy that progressively reveals interior regions with structural coherence, followed by texturing each layer using a pretrained diffusion model. To seamlessly merge textures obtained across layers, we propose a soft UV-space blending technique that weighs each texture’s contribution based on view-dependent visibility confidence. Empirical results demonstrate that GOATex consistently outperforms existing methods, producing seamless, high-fidelity textures across both visible and occluded surfaces. Unlike prior works, GOATex operates entirely without costly fine-tuning of a pretrained diffusion model and allows separate prompting for exterior and interior mesh regions, enabling fine-grained control over layered appearances.
One-Step Diffusion-Based Image Compression with Semantic Distillation
Naifu Xue · Zhaoyang Jia · Jiahao Li · Bin Li · Yuan Zhang · Yan Lu
While recent diffusion-based generative image codecs have shown impressive performance, their iterative sampling process introduces unpleasant latency. In this work, we revisit the design of a diffusion-based codec and argue that multi-step sampling is not necessary for generative compression. Based on this insight, we propose OneDC, a One-step Diffusion-based generative image Codec—that integrates a latent compression module with a one-step diffusion generator. Recognizing the critical role of semantic guidance in one-step diffusion, we propose using the hyperprior as a semantic signal, overcoming the limitations of text prompts in representing complex visual content. To further enhance the semantic capability of the hyperprior, we introduce a semantic distillation mechanism that transfers knowledge from a pretrained generative tokenizer to the hyperprior codec. Additionally, we adopt a hybrid pixel- and latent-domain optimization to jointly enhance both reconstruction fidelity and perceptual realism. Extensive experiments demonstrate that OneDC achieves SOTA perceptual quality even with one-step generation, offering over 39% bitrate reduction and 20× faster decoding compared to prior multi-step diffusion-based codecs. Project: https://onedc-codec.github.io/
HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance
JUE GONG · Tingyu Yang · Jingkai Wang · Zheng Chen · Xing Liu · Hong Gu · Yulun Zhang · Xiaokang Yang
Human-centered images often suffer from severe generic degradation during transmission and are prone to human motion blur (HMB), making restoration challenging. Existing research lacks sufficient focus on these issues, as both problems often coexist in practice. To address this, we design a degradation pipeline that simulates the coexistence of HMB and generic noise, generating synthetic degraded data to train our proposed HAODiff, a human-aware one-step diffusion. Specifically, we propose a triple-branch dual-prompt guidance (DPG), which leverages high-quality images, residual noise (LQ minus HQ), and HMB segmentation masks as training targets. It produces a positive–negative prompt pair for classifier‑free guidance (CFG) in a single diffusion step. The resulting adaptive dual prompts let HAODiff exploit CFG more effectively, boosting robustness against diverse degradations. For fair evaluation, we introduce MPII‑Test, a benchmark rich in combined noise and HMB cases. Extensive experiments show that our HAODiff surpasses existing state-of-the-art (SOTA) methods in terms of both quantitative metrics and visual quality on synthetic and real-world datasets, including our introduced MPII-Test. Code is available at: https://github.com/gobunu/HAODiff.
Zero-Shot Blind-Spot Image Denoising via Cross-Scale Non-Local Pixel Refilling
Qilong Guo · Tianjing Zhang · Zhiyuan Ma · Hui Ji
Blind-spot denoising (BSD) method is a powerful paradigm for zero-shot image denoising by training models to predict masked target pixels from their neighbors. However, they struggle with real-world noise exhibiting strong local correlations, where efforts to suppress noise correlation often weaken pixel-value dependencies, adversely affecting denoising performance. This paper presents a theoretical analysis quantifying the impact of replacing masked pixels with observations exhibiting weaker noise correlation but potentially reduced similarity, revealing a trade-off that impacts the statistical risk of the estimation. Guided by this insight, we propose a computational scheme that replaces masked pixels with distant ones of similar appearance and lower noise correlation. This strategy improves the prediction by balancing noise suppression and structural consistency. Experiments confirm the effectiveness of our method, outperforming existing zero-shot BSD methods.
Can NeRFs "See" without Cameras?
Chaitanya Amballa · Yu-Lin Wei · Sattwik Basu · Zhijian Yang · Mehmet Ergezer · Romit Roy Choudhury
Neural Radiance Fields (NeRFs) have been remarkably successful at synthesizing novel views of 3D scenes by optimizing a volumetric scene function. This scene function models how optical rays bring color information from a 3D object to the camera pixels. Radio frequency (RF) or audio signals can also be viewed as a vehicle for delivering information about the environment to a sensor. However, unlike camera pixels, an RF/audio sensor receives a mixture of signals that contain many environmental reflections (also called “multipath”). Is it still possible to infer the environment using such multipath signals? We show that with redesign, NeRFs can be taught to learn from multipath signals, and thereby “see” the environment. As a grounding application, we aim to infer the indoor floorplan of a home from sparse WiFi measurements made at multiple locations inside the home. Although a difficult inverse problem, our implicitly learnt floorplans look promising, and enables forward applications, such as indoor signal prediction and basic ray tracing.
DON’T NEED RETRAINING: A Mixture of DETR and Vision Foundation Models for Cross-Domain Few-Shot Object Detection
Changhan Liu · xunzhi xiang · Zixuan Duan · Wenbin Li · Qi Fan · Yang Gao
Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to generalize to unseen domains by leveraging a few annotated samples of the target domain, requiring models to exhibit both strong generalization and localization capabilities. However, existing well-trained detectors typically have strong localization capabilities but lack generalization, whereas vision foundation models (VFMs) generally exhibit better generalization but lack accurate localization capabilities. In this paper, we propose a novel Mixture-of-Experts (MoE) structure that integrates the detector's localization capability and the VFM's generalization by using VFM features to improve detector features. Specifically, we propose Expert-wise Router (ER) that selects the most relevant VFM experts for each backbone layer, and Region-wise Router (RR) that emphasizes foreground and suppress background. To bridge representation gaps, we further propose Shared Expert Projection (SEP) module and Private Expert Projection (PEP) module, which align VFM features to the detector feature space while decoupling shared image feature from private image feature in the VFM feature map. Finally, we propose MoE module to transfer the VFM’s generalization to the detector without altering the detector original architecture. Furthermore, our method extend well-trained detectors for detecting novel classes in unseen domains without re-training on the base classes. Experimental results on multiple cross-domain datasets validate the effectiveness of our method.
Looking Into the Water by Unsupervised Learning of the Surface Shape
Ori Lifschitz · Tali Treibitz · Dan Rosenbaum
We address the problem of looking into the water from the air, where we seek to remove image distortions caused by refractions at the water surface. Our approach is based on modeling the different water surface structures at various points in time, assuming the underlying image is constant. To this end, we propose a model that consists of two neural-field networks. The first network predicts the height of the water surface at each spatial position and time, and the second network predicts the image color at each position. Using both networks, we reconstruct the observed sequence of images and can therefore use unsupervised training. We show that using implicit neural representations with periodic activation functions (SIREN) leads to effective modeling of the surface height spatio-temporal signal and its derivative, as required for image reconstruction. Using both simulated and real data we show that our method outperforms the latest unsupervised image restoration approach. In addition, it provides an estimate of the water surface.
Autoregressive Motion Generation with Gaussian Mixture-Guided Latent Sampling
Linnan Tu · Lingwei Meng · Zongyi Li · Hefei Ling · Shijuan Huang
Existing efforts in motion synthesis typically utilize either generative transformers with discrete representations or diffusion models with continuous representations. However, the discretization process in generative transformers can introduce motion errors, while the sampling process in diffusion models tends to be slow. In this paper, we propose a novel text-to-motion synthesis method GMMotion that combines a continuous motion representation with an autoregressive model, using the Gaussian mixture model (GMM) to represent the conditional probability distribution. Unlike autoregressive approaches relying on residual vector quantization, our model employs continuous motion representations derived from the VAE's latent space. This choice streamlines both the training and the inference processes. Specifically, we utilize a causal transformer to learn the distributions of continuous motion representations, which are modeled with a learnable Gaussian mixture model. Extensive experiments demonstrate that our model surpasses existing state-of-the-art models in the motion synthesis task.
Toward Real-world Text Image Forgery Localization: Structured and Interpretable Data Synthesis
Zeqin Yu · Haotao Xie · Jian Zhang · Jiangqun Ni · Wenkang Su · Jiwu Huang
Existing Text Image Forgery Localization (T-IFL) methods often suffer from poor generalization due to the limited scale of real-world datasets and the distribution gap caused by synthetic data that fails to capture the complexity of real-world tampering. To tackle this issue, we propose Fourier Series-based Tampering Synthesis (FSTS), a structured and interpretable framework for synthesizing tampered text images. FSTS first collects 16,750 real-world tampering instances from five representative tampering types, using a structured pipeline that records human-performed editing traces via multi-format logs (e.g., video, PSD, and editing logs). By analyzing these collected parameters and identifying recurring behavioral patterns at both individual and population levels, we formulate a hierarchical modeling framework. Specifically, each individual tampering parameter is represented as a compact combination of basis operation–parameter configurations, while the population-level distribution is constructed by aggregating these behaviors. Since this formulation draws inspiration from the Fourier series, it enables an interpretable approximation using basis functions and their learned weights. By sampling from this modeled distribution, FSTS synthesizes diverse and realistic training data that better reflect real-world forgery traces. Extensive experiments across four evaluation protocols demonstrate that models trained with FSTS data achieve significantly improved generalization on real-world datasets. Dataset is available at \href{https://github.com/ZeqinYu/FSTS}{Project Page}.
Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale
James Michaelov · Roger Levy · Benjamin Bergen
We show that across architecture (Transformer vs. Mamba vs. RWKV), training dataset (OpenWebText vs. The Pile), and scale (14 million parameters to 12 billion parameters), autoregressive language models exhibit highly consistent patterns of change in their behavior over the course of pretraining. Based on our analysis of over 1,400 language model checkpoints on over 110,000 tokens of English, we find that up to 98\% of the variance in language model behavior at the word level can be explained by three simple heuristics: the unigram probability (frequency) of a given word, the $n$-gram probability of the word, and the semantic similarity between the word and its context. Furthermore, we see consistent behavioral phases in all language models, with their predicted probabilities for words overfitting to those words' $n$-gram probabilities for increasing $n$ over the course of training. Taken together, these results suggest that learning in neural language models may follow a similar trajectory irrespective of model details.
Constrained Feedback Learning for Non-Stationary Multi-Armed Bandits
Shaoang Li · Jian Li
Non-stationary multi-armed bandits (nsMAB) enable agents to adapt to changing environments by incorporating mechanisms to detect and respond to shifts in reward distributions, making them well-suited for dynamic settings. However, existing approaches typically assume that reward feedback is available at every round—an assumption that overlooks many real-world scenarios where feedback is limited. In this paper, we take a significant step forward by introducing a new model of *constrained feedback in non-stationary multi-armed bandits* (ConFee-nsMAB), where the availability of reward feedback is restricted. We propose the first prior-free algorithm—that is, one that does not require prior knowledge of the degree of non-stationarity—that achieves near-optimal dynamic regret in this setting. Specifically, our algorithm attains a dynamic regret of $\tilde {\mathcal{O}}({K^{1/3} V_T^{1/3} T }/{ B^{1/3}})$, where $T$ is the number of rounds, $K$ is the number of arms, $B$ is the query budget, and $V_T$ is the variation budget capturing the degree of non-stationarity.
Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation
Jingmin Zhu · Anqi Zhu · Hossein Rahmani · Jun Liu · Mohammed Bennamoun · Qiuhong Ke
We introduce Skeleton-Cache, the first training-free test-time adaptation framework for skeleton-based zero-shot action recognition (SZAR), aimed at improving model generalization to unseen actions during inference. Skeleton-Cache reformulates inference as a lightweight retrieval process over a non-parametric cache that stores structured skeleton representations, combining both global and fine-grained local descriptors. To guide the fusion of descriptor-wise predictions, we leverage the semantic reasoning capabilities of large language models (LLMs) to assign class-specific importance weights. By integrating these structured descriptors with LLM-guided semantic priors, Skeleton-Cache dynamically adapts to unseen actions without any additional training or access to training data. Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zero-shot and generalized zero-shot settings. The code is publicly available at https://github.com/Alchemist0754/Skeleton-Cache.
Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Pei Peng · Ming-Kun Xie · Hang Hao · Tong Jin · Sheng-Jun Huang
Object–context shortcuts remain a persistent challenge in vision‑language models, undermining zero‑shot reliability when test-time scenes diverge from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer it at inference time, we estimate object and background expectations within CLIP’s representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background‑only activation, preserving beneficial object–context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero‑shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
State Entropy Regularization for Robust Reinforcement Learning
Yonatan Ashlag · Uri Koren · Mirco Mutti · Esther Derman · Pierre-Luc Bacon · Shie Mannor
State entropy regularization has empirically shown better exploration and sample complexity in reinforcement learning (RL). However, its theoretical guarantees have not been studied. In this paper, we show that state entropy regularization improves robustness to structured and spatially correlated perturbations. These types of variation are common in transfer learning but often overlooked by standard robust RL methods, which typically focus on small, uncorrelated changes. We provide a comprehensive characterization of these robustness properties, including formal guarantees under reward and transition uncertainty, as well as settings where the method performs poorly. Much of our analysis contrasts state entropy with the widely used policy entropy regularization, highlighting their different benefits. Finally, from a practical standpoint, we illustrate that compared with policy entropy, the robustness advantages of state entropy are more sensitive to the number of rollouts used for policy evaluation.
A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning
Zechen Wu · Amy Greenwald · Ronald Parr
In off-policy policy evaluation (OPE) tasks within reinforcement learning, Temporal Difference Learning(TD) and Fitted Q-Iteration (FQI) have traditionally been viewed as differing in the number of updates toward the target value function: TD makes one update, FQI makes an infinite number, and Partial Fitted Q-Iteration (PFQI) performs a finite number. We show that this view is not accurate, and provide a new mathematical perspective under linear value function approximation that unifies these methods as a single iterative method solving same linear system, but using different matrix splitting schemes and preconditioners. We show that increasing the number of updates under the same target value function, i.e., the target network technique, is a transition from using a constant preconditioner to using a data-feature adaptive preconditioner. This elucidates, for the first time, why TD convergence does not necessarily imply FQI convergence, and establishes tight convergence connections among TD, PFQI, and FQI. Our framework enables sharper theoretical results than previous work and characterization of the convergence conditions for each algorithm, without relying on assumptions about the features (e.g., linear independence). We also provide an encoder-decoder perspective to better understand TD’s convergence conditions, and prove, for the first time, that when a large learning rate doesn’t work, trying a smaller one may help(for batch TD). Our framework also leads to the discovery of new crucial conditions on features for convergence, and shows how common assumptions about features influence convergence, e.g., the assumption of linearly independent features can be dropped without compromising the convergence guarantees of stochastic TD in the on-policy setting. This paper is also the first to introduce matrix splitting into the convergence analysis of these algorithms.
Shift Before You Learn: Enabling Low-Rank Representations in Reinforcement Learning
Bastien Dubail · Stefan Stojanovic · Alexandre Proutiere
Low-rank structure is a common implicit assumption in many modern reinforcement learning (RL) algorithms. For instance, reward-free and goal-conditioned RL methods often presume that the successor measure admits a low-rank representation. In this work, we challenge this assumption by first remarking that the successor measure itself is not approximately low-rank. Instead, we demonstrate that a low-rank structure naturally emerges in the shifted successor measure, which captures the system dynamics after bypassing a few initial transitions. We provide finite-sample performance guarantees for the entry-wise estimation of a low-rank approximation of the shifted successor measure from sampled entries. Our analysis reveals that both the approximation and estimation errors are primarily governed by a newly introduced quantitity: the spectral recoverability of the corresponding matrix. To bound this parameter, we derive a new class of functional inequalities for Markov chains that we call Type II Poincaré inequalities and from which we can quantify the amount of shift needed for effective low-rank approximation and estimation. This analysis shows in particular that the required shift depends on decay of the high-order singular values of the shifted successor measure and is hence typically small in practice. Additionally, we establish a connection between the necessary shift and the local mixing properties of the underlying dynamical system, which provides a natural way of selecting the shift. Finally, we validate our theoretical findings with experiments, and demonstrate that shifting the successor measure indeed leads to improved performance in goal-conditioned RL.
When Can Model-Free Reinforcement Learning be Enough for Thinking?
Josiah Hanna · Nicholas Corrado
Recent work on large language models has demonstrated the use of model-free reinforcement learning (RL) to train reasoning-like capabilities. The emergence of "thinking" through model-free RL is interesting as thinking actions neither produce reward nor change the external world state to one where the agent is more likely to get reward. This paper seeks to build a domain-independent understanding of when model-free RL will lead to such "thinking" as a strategy for reward maximization. To build this understanding, we first introduce a theoretical model which we call a thought Markov decision process (MDP). Thought MDPs minimally extend the classical MDP model to include an abstract notion of thought state and thought action. Using the thought MDP model, we prove the importance of policy initialization in determining whether or not thinking emerges and show formally that thought actions are equivalent to the agent choosing to perform a step of policy improvement before continuing to act. We then show that open-source LLMs satisfy the conditions that our theory predicts are necessary for model-free RL to produce thinking-like behavior. Finally, we hypothesize sufficient conditions that would enable thinking to be learned outside of language generation and introduce a toy domain where a combination of multi-task pre-training and designated thought actions enable more data-efficient RL compared to non-thinking agents.
Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior
Ruoyu Feng · Yunpeng Qi · Jinming Liu · Yixin Gao · Xin Li · Xin Jin · Zhibo Chen
Image compression methods are usually optimized isolatedly for human perception or machine analysis tasks. We reveal fundamental commonalities between these objectives: preserving accurate semantic information is paramount, as it directly dictates the integrity of critical information for intelligent tasks and aids human understanding. Concurrently, enhanced perceptual quality not only improves visual appeal but also, by ensuring realistic image distributions, benefits semantic feature extraction for machine tasks. Based on this insight, we propose Diff-ICMH, a generative image compression framework aiming for harmonizing machine and human vision in image compression. It ensures perceptual realism by leveraging generative priors and simultaneously guarantees semantic fidelity through the incorporation of Semantic Consistency loss (SC loss) during training. Additionally, we introduce the Tag Guidance Module (TGM) that leverages highly semantic image-level tags to stimulate the pre-trained diffusion model's generative capabilities, requiring minimal additional bit rates. Consequently, Diff-ICMH supports multiple intelligent tasks through a single codec and bitstream without any task-specific adaptation, while preserving high-quality visual experience for human perception. Extensive experimental results demonstrate Diff-ICMH's superiority and generalizability across diverse tasks, while maintaining visual appeal for human perception.
Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era
Feng Lu · Tong Jin · Canming Ye · Xiangyuan Lan · Yunpeng Liu · Chun Yuan
Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e.g., NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly processed and interact globally via the intrinsic self-attention mechanism, implicitly aggregating useful information within the patch tokens to the aggregation tokens. Finally, we only take these aggregation tokens from the last output tokens and concatenate them as the global representation. Although implicit aggregation can provide robust global descriptors in an extremely simple manner, where and how to insert additional tokens, as well as the initialization of tokens, remains an open issue worthy of further exploration. To this end, we also propose the optimal token insertion strategy and token initialization method derived from empirical studies. Experimental results show that our method outperforms state-of-the-art methods on several VPR datasets with higher efficiency and ranks 1st on the MSLS challenge leaderboard. The code is available at https://github.com/lu-feng/image.
Unified Algorithms for RL with Decision-Estimation Coefficients: PAC, Reward-Free, Preference-Based Learning, and Beyond
Fan Chen · Song Mei · Yu Bai
Modern Reinforcement Learning (RL) is more than just learning the optimal policy; Alternative learning goals such as exploring the environment, estimating the underlying model, and learning from preference feedback are all of practical importance. While provably sample-efficient algorithms for each specific goal have been proposed, these algorithms often depend strongly on the particular learning goal and thus admit different structures correspondingly. It is an urging open question whether these learning goals can rather be tackled by a single unified algorithm. We make progress on this question by developing a unified algorithm framework for a large class of learning goals, building on the Decision-Estimation Coefficient (DEC) framework. Our framework handles many learning goals such as no-regret RL, PAC RL, reward-free learning, model estimation, and preference-based learning, all by simply instantiating the same generic complexity measure called "Generalized DEC", and a corresponding generic algorithm. The generalized DEC also yields a sample complexity lower bound for each specific learning goal. As applications, we propose "decouplable representation" as a natural sufficient condition for bounding generalized DECs, and use it to obtain many new sample-efficient results (and recover existing results) for a wide range of learning goals and problem classes as direct corollaries. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling and Maximum Likelihood Estimation, showing that they enjoy sample complexity bounds under similar structural conditions as the DEC.
Foundation Models for Scientific Discovery: From Paradigm Enhancement to Paradigm Transition
Fan LIU · Jindong Han · Tengfei Lyu · Weijia Zhang · Zherui Yang · Lu Dai · Cancheng Liu · Hao Liu
Foundation models (FMs), such as GPT-4 and AlphaFold, are reshaping the landscape of scientific research. Beyond accelerating tasks such as hypothesis generation, experimental design, and result interpretation, they prompt a more fundamental question: Are FMs merely enhancing existing scientific methodologies, or are they redefining the way science is conducted? In this paper, we argue that FMs are catalyzing a transition toward a new scientific paradigm. We introduce a three-stage framework to describe this evolution: (1) Meta-Scientific Integration, where FMs enhance workflows within traditional paradigms; (2) Hybrid Human-AI Co-Creation, where FMs become active collaborators in problem formulation, reasoning, and discovery; and (3) Autonomous Scientific Discovery, where FMs operate as independent agents capable of generating new scientific knowledge with minimal human intervention. Through this lens, we review current applications and emerging capabilities of FMs across existing scientific paradigms. We further identify risks and future directions for FM-enabled scientific discovery. This position paper aims to support the scientific community in understanding the transformative role of FMs and to foster reflection on the future of scientific discovery.
Position: Require Frontier AI Labs To Release Small "Analog" Models
Shriyash Upadhyay · Philip Quirke · Narmeen Oozeer · Chaithanya Bandi
Recent proposals for regulating frontier AI models have sparked concerns about the cost of safety regulation, and most such regulations have been shelved due to the safety-innovation tradeoff. This paper argues for an alternative regulatory approach that ensures AI safety while actively \textit{promoting} innovation: mandating that large AI laboratories release small, openly accessible "analog models"—scaled-down versions trained similarly to and distilled from their largest proprietary models.Analog models serve as public proxies, allowing broad participation in safety verification, interpretability research, and algorithmic transparency without forcing labs to disclose their full-scale models. Recent research demonstrates that safety and interpretability methods developed using these smaller models generalize effectively to frontier-scale systems. By enabling the wider research community to directly investigate and innovate upon accessible analogs, our policy substantially reduces the regulatory burden and accelerates safety advancements.This mandate promises minimal additional costs, leveraging reusable resources like data and infrastructure, while significantly contributing to the public good. Our hope is not only that this policy be adopted, but that it illustrates a broader principle supporting fundamental research in machine learning: deeper understanding of models relaxes the safety-innovation tradeoff and lets us have more of both.
IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios
Yifan Li · Yuhang Chen · Anh Dao · Lichi Li · Zhongyi Cai · Zhen Tan · Tianlong Chen · Yu Kong
Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety-critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real-world industrial applications. To bridge this, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabilities within safety-critical industrial warehouse scenarios. Built upon the NVIDIA Isaac Sim platform, IndustryEQA provides high-fidelity episodic memory videos featuring diverse industrial assets, dynamic human agents, and carefully designed hazardous situations inspired by real-world safety guidelines. The benchmark includes rich annotations covering six categories: equipment safety, human safety, object recognition, attribute recognition, temporal understanding, and spatial understanding. Besides, it also provides extra reasoning evaluation based on these categories. Specifically, it comprises 971 question-answer pairs generated from small warehouse scenarios and 373 pairs from large ones, incorporating scenarios with and without human. We further propose a comprehensive evaluation framework, including various baseline models, to assess their general perception and reasoning abilities in industrial environments. IndustryEQA aims to steer EQA research towards developing more robust, safety-aware, and practically applicable embodied agents for complex industrial environments.
DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding
Zixuan Liu · Siavash H. Khajavi · Guangkai Jiang
Recent advances in multi-modal models have demonstrated strong performance in tasks such as image generation and reasoning. However, applying these models to the fire domain remains challenging due to the lack of publicly available datasets with high-quality fire domain annotations. To address this gap, we introduce $\textbf{DetectiumFire}$, a large-scale, multi-modal dataset comprising of 22.5k high-resolution fire-related images and 2.5k real-world fire-related videos covering a wide range of fire types, environments, and risk levels. The data are annotated with both traditional computer vision labels (e.g., bounding boxes) and detailed textual prompts describing the scene, enabling applications such as synthetic data generation and fire risk reasoning. DetectiumFire offers clear advantages over existing benchmarks in scale, diversity, and data quality, significantly reducing redundancy and enhancing coverage of real-world scenarios. We validate the utility of DetectiumFire across multiple tasks, including object detection, diffusion-based image generation, and vision-language reasoning. Our results highlight the potential of this dataset to advance fire-related research and support the development of intelligent safety systems. We release DetectiumFire to promote broader exploration of fire understanding in the AI community.
DyFlow: Dynamic Workflow Framework for Agentic Reasoning
Yanbo Wang · Zixiang Xu · Yue Huang · Xiangqi Wang · Zirui Song · Lang Gao · Chenxi Wang · Robert Tang · Yue Zhao · Arman Cohan · Xiangliang Zhang · Xiuying Chen
Agent systems based on large language models (LLMs) have shown great potential in complex reasoning tasks, but building efficient and generalizable workflows remains a major challenge. Most existing approaches rely on manually designed processes, which limits their adaptability across different tasks. While a few methods attempt automated workflow generation, they are often tied to specific datasets or query types and make limited use of intermediate feedback, reducing system robustness and reasoning depth. Moreover, their operations are typically predefined and inflexible. To address these limitations, we propose DyFlow, a dynamic workflow generation framework that adaptively constructs and adjusts reasoning procedures based on task requirements and real-time intermediate feedback, thereby enhancing cross-task generalization. DyFlow consists of two core components: a designer and an executor. The designer decomposes complex problems into a sequence of sub-goals defined by high-level objectives and dynamically plans the next steps based on intermediate outputs and feedback. These plans are then carried out by the executor, which executes each operation using dynamic operators with context-aware parameterization, enabling flexible and semantically grounded reasoning. We systematically evaluate DyFlow across diverse domains, including social reasoning, biomedical tasks, mathematical problem solving, and code generation. Results demonstrate that DyFlow significantly outperforms existing baselines, achieving substantial Pass@k improvements and exhibiting robust generalization across diverse domains.
CF-VLM:CounterFactual Vision-Language Fine-tuning
jusheng zhang · Kaitong Cai · Yijia Fan · Jian Wang · Keze Wang
Recent advances in vision-language models (VLMs) have greatly improved cross-modal semantic understanding, yet significant limitations remain in fine-grained discrimination and deep causal reasoning tasks. Existing VLMs often rely on superficial statistical correlations, lacking the ability to capture the underlying causal logic between visual and textual content. To address this, we propose the CounterFactual Vision-Language Fine-tuning Model (CF-VLM), a novel framework that enhances the causal reasoning capabilities of VLMs through the targeted use of counterfactual samples. CF-VLM introduces three complementary training objectives: maintaining foundational cross-modal alignment, reinforcing the uniqueness, and stability of factual scene representations against coherent counterfactuals, and sharpening the model’s sensitivity to minimal but critical causal edits. Extensive experiments demonstrate that CF-VLM consistently outperforms strong baselines and state-of-the-art methods on compositional reasoning and generalization benchmarks. Furthermore, it shows promise in mitigating visual hallucinations, indicating improved factual consistency. Our CF-VLM provides a robust foundation for deploying VLMs in high-stakes, real-world scenarios requiring reliable reasoning and interpretability.
Adaptive Quantization in Generative Flow Networks for Probabilistic Sequential Prediction
Nadhir Hassen · Zhen Zhang · Johan Verjans
Probabilistic time series forecasting, essential in domains like healthcare and neuroscience, requires models capable of capturing uncertainty and intricate temporal dependencies. While deep learning has advanced forecasting, generating calibrated probability distributions over continuous future values remains challenging. We introduce Temporal Generative Flow Networks (Temporal GFNs), adapting Generative Flow Networks (GFNs) – a powerful framework for generating compositional objects – to this sequential prediction task. GFNs learn policies to construct objects (eg. forecast trajectories) step-by-step, sampling final objects proportionally to a reward signal. However, applying GFNs directly to continuous time series necessitates addressing their inherently discrete action spaces and ensuring differentiability. Our framework tackles this by representing time series segments as states and sequentially generating future values via quantized actions chosen by a forward policy. We introduce two key innovations: (1) An adaptive, curriculum-based quantization strategy that dynamically adjusts the number of discretization bins based on reward improvement and policy entropy, balancing precision and exploration throughout training. (2) A straight-through estimator mechanism enabling the forward policy to output both discrete (hard) samples for trajectory construction and continuous (soft) samples for stable gradient propagation. Training utilizes a trajectory balance loss objective, ensuring flow consistency, augmented by an entropy regularizer. We provide rigorous theoretical bounds on the quantization error's impact and the adaptive factor's range. We demonstrate how Temporal GFNs offer a principled way to leverage the structured generation capabilities of GFNs for probabilistic forecasting in continuous domains.
Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models
Zidi Xiong · Shan Chen · Zhenting Qi · Himabindu Lakkaraju
Large Reasoning Models (LRMs) have significantly enhanced their capabilities in complex problem-solving by introducing a thinking draft that enables multi-path Chain-of-Thought explorations before producing final answers. Ensuring the faithfulness of these intermediate reasoning processes is crucial for reliable monitoring, interpretation, and effective control. In this paper, we propose a systematic counterfactual intervention framework to rigorously evaluate thinking draft faithfulness. Our approach focuses on two complementary dimensions: (1) Intra-Draft Faithfulness, which assesses whether individual reasoning steps causally influence subsequent steps and the final draft conclusion through counterfactual step insertions; and (2) Draft-to-Answer Faithfulness, which evaluates whether final answers are logically consistent with and dependent on the thinking draft, by perturbing the draft’s concluding logic. We conduct extensive experiments across six state-of-the-art LRMs. Our findings show that current LRMs demonstrate selective faithfulness to intermediate reasoning steps and frequently fail to faithfully align with the draft conclusions. These results underscore the need for more faithful and interpretable reasoning in advanced LRMs.
Beyond Least Squares: Uniform Approximation and the Hidden Cost of Misspecification
Davide Maran · Csaba Szepesvari
We study the problem of controlling worst-case errors in misspecified linear regression under the random design setting, where the regression function is estimated via (penalized) least-squares. This setting arises naturally in value function approximation for bandit algorithms and reinforcement learning (RL). Our first main contribution is the observation that the amplification of the misspecification error when using least-squares is governed by the \emph{Lebesgue constant}, a classical quantity from approximation theory that depends on the choice of the feature subspace and the covariate distribution. We also show that this dependence on the misspecification error is tight for least-squares regression: in general, no method minimizing the empirical squared loss, including regularized least-squares, can improve it substantially. We argue this explains the empirical observation that some feature-maps (e.g., those derived from the Fourier bases) ``work better in RL'' than others (e.g., polynomials): given some covariate distribution, the Lebesgue constant is known to be highly sensitive to choice of the feature-map. As a second contribution, we propose a method that augments the original feature set with auxiliary features designed to reduce the error amplification. We then prove that the method successfully competes with an "oracle'' that knows the best way of using the auxiliary features to reduce this amplification. For example, when the domain is a real interval and the features are monomials, our method reduces the amplification factor to $O(1)$ as $d\to\infty$, while without our method, least-squares with the monomials (and in fact polynomials) will suffer a worst-case error amplification of order $\Omega(d)$. It follows that there are functions and feature maps for which our method is consistent, while least-squares is inconsistent.
VERA: Variational Inference Framework for Jailbreaking Large Language Models
Anamika Lochab · Lu Yan · Patrick Pynadath · Xiangyu Zhang · Ruqi Zhang
The rise of API-only access to state-of-the-art LLMs highlights the need for effective black-box jailbreak methods to identify model vulnerabilities in real-world settings. Without a principled objective for gradient-based optimization, most existing approaches rely on genetic algorithms, which are limited by their initialization and dependence on manually curated prompt pools. Furthermore, these methods require individual optimization for each prompt, failing to provide a comprehensive characterization of model vulnerabilities. To address this gap, we introduce VERA: Variational infErence fRamework for jAilbreaking. VERA casts black-box jailbreak prompting as a variational inference problem, training a small attacker LLM to approximate the target LLM’s posterior over adversarial prompts. Once trained, the attacker can generate diverse, fluent jailbreak prompts for a target query without re-optimization. Experimental results show that VERA achieves strong performance across a range of target LLMs, highlighting the value of probabilistic inference for adversarial prompt generation.
Unveiling Extraneous Sampling Bias with Data Missing-Not-At-Random
Chunyuan Zheng · Haocheng Yang · Haoxuan Li · Mengyue Yang
Selection bias poses a widely recognized challenge for unbiased evaluation and learning in many industrial scenarios. For example, in recommender systems, it arises from the users' selective interactions with items. Recently, doubly robust and its variants have been widely studied to achieve debiased learning of prediction models, however, all of them consider a simple exact matching scenario, i.e., the units (such as user-item pairs in a recommender system) are the same between the training and test sets. In practice, there may be limited or even no overlap in units between the training and test. In this paper, we consider a more practical scenario: the joint distribution of the feature and rating is the same in the training and test sets. Theoretical analysis shows that the previous DR estimator is biased even if the imputed errors and learned propensities are correct in this scenario. In addition, we propose a novel super-population doubly robust estimator (SuperDR), which can achieve a more accurate estimation and desirable generalization error bound compared to the existing DR estimators, and extend the joint learning algorithm for training the prediction and imputation models. We conduct extensive experiments on three real-world datasets, including a large-scale industrial dataset, to show the effectiveness of our method. The code is available at https://github.com/ChunyuanZheng/neurips-25-SuperDR.
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
Guo Chen · Zhiqi Li · Shihao Wang · Jindong Jiang · Yicheng Liu · Lidong Lu · De-An Huang · Wonmin Byeon · Matthieu Le · Max Ehrlich · Tong Lu · Limin Wang · Bryan Catanzaro · Jan Kautz · Andrew Tao · Zhiding Yu · Guilin Liu
We introduce Eagle2.5, a frontier vision-language model (VLM) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle2.5-8B achieves 72.4\% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B.
Sharp Analysis for KL-Regularized Contextual Bandits and RLHF
Heyang Zhao · Chenlu Ye · Quanquan Gu · Tong Zhang
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness of KL-regularization has been empirically demonstrated in various practical scenarios, current theoretical analyses of KL-regularized RLHF still yield the same $\mathcal{O}(1 / \epsilon^2)$ sample complexity as ones without KL-regularization. To understand the fundamental distinction between objectives with KL-regularization and ones without KL-regularization, we are the first to theoretically demonstrate the power of KL-regularization by providing a sharp analysis for KL-regularized contextual bandits and RLHF, revealing an $\mathcal{O}(1 / \epsilon)$ sample complexity when $\epsilon$ is sufficiently small. We also prove matching lower bounds for both settings. More specifically, we study how the coverage of the reference policy affects the sample complexity of KL-regularized online contextual bandits and RLHF. We show that with sufficient coverage from the reference policy, a simple two-stage mixed sampling algorithm can achieve an $\mathcal{O}(1 / \epsilon)$ sample complexity with only an additive dependence on the coverage coefficient, thus proving the benefits of online data even without explicit exploration. Our results provide a comprehensive understanding of the roles of KL-regularization and data coverage in online decision making, shedding light on the design of more efficient algorithms.
Time-o1: Time-Series Forecasting Needs Transformed Label Alignment
Hao Wang · Licheng Pan · Zhichao Chen · Xu Chen · Qingyang Dai · Lei Wang · Haoxuan Li · Zhouchen Lin
Training time-series forecast models presents unique challenges in designing effective learning objectives. Existing methods predominantly utilize the temporal mean squared error, which faces two critical challenges: (1) label autocorrelation, which leads to bias from the label sequence likelihood; (2) excessive amount of tasks, which increases with the forecast horizon and complicates optimization. To address these challenges, we propose Time-o1, a transformation-augmented learning objective for training time-series forecasting models. The central idea is to transform the label sequence into decorrelated components with discriminated significance. Models are then trained to align the most significant components, thereby effectively mitigating label autocorrelation and reducing task amount. Extensive experiments demonstrate that Time-o1 achieves state-of-the-art performance and is compatible with various forecast models. Code is available at https://github.com/Master-PLC/Time-o1.
AdvPrefix: An Objective for Nuanced LLM Jailbreaks
Sicheng Zhu · Brandon Amos · Yuandong Tian · Chuan Guo · Ivan Evtimov
Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix ``Sure, here is (harmful request)''. While straightforward, this objective has two limitations: limited control over model behaviors, yielding incomplete or unrealistic jailbroken responses, and a rigid format that hinders optimization. We introduce AdvPrefix, a plug-and-play prefix-forcing objective that selects one or more model-dependent prefixes by combining two criteria: high prefilling attack success rates and low negative log-likelihood. AdvPrefix integrates seamlessly into existing jailbreak attacks to mitigate the previous limitations for free. For example, replacing GCG's default prefixes on Llama-3 improves nuanced attack success rates from 14\% to 80\%, revealing that current safety alignment fails to generalize to new prefixes. Code and selected prefixes are released.
Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows
Ruixiang Zhang · Shuangfei Zhai · Jiatao Gu · Yizhe Zhang · Huangjie Zheng · Tianrong Chen · Miguel Angel Bautista · Joshua Susskind · Navdeep Jaitly
Autoregressive models have driven remarkable progress in language modeling. Their foundational reliance on discrete tokens, unidirectional context, and single-pass decoding, while central to their success, also inspires the exploration of a design space that could offer new axes of modeling flexibility. In this work, we explore an alternative paradigm, shifting language modeling from a discrete token space to a continuous latent space. We propose a novel framework that employs transformer-based autoregressive normalizing flows to model these continuous representations. This approach unlocks substantial flexibility, enabling the construction of models that can capture global bi-directional context through stacked, alternating-direction autoregressive transformations, support block-wise generation with flexible token patch sizes, and facilitate a hierarchical multi-pass generation process. We further propose new mixture-based coupling transformations designed to capture complex dependencies within the latent space shaped by discrete data, and demonstrate theoretical connections to conventional discrete autoregressive models. Extensive experiments on language modeling benchmarks demonstrate strong likelihood performance and highlight the flexible modeling capabilities inherent in our framework.
Scalable Exploration via Ensemble++
Yingru Li · Jiawei Xu · Baoxiang Wang · Zhiquan Luo
Thompson Sampling is a principled method for balancing exploration and exploitation, but its real-world adoption faces computational challenges in large-scale or non-conjugate settings. While ensemble-based approaches offer partial remedies, they typically require prohibitively large ensemble sizes. We propose Ensemble++, a scalable exploration framework using a novel shared-factor ensemble architecture with random linear combinations. For linear bandits, we provide theoretical guarantees showing that Ensemble++ achieves regret comparable to exact Thompson Sampling with only $\Theta(d \log T)$ ensemble sizes--significantly outperforming prior methods. Crucially, this efficiency holds across both compact and finite action sets with either time-invariant or time-varying contexts without configuration changes. We extend this theoretical foundation to nonlinear rewards by replacing fixed features with learnable neural representations while preserving the same incremental update principle, effectively bridging theory and practice for real-world tasks. Comprehensive experiments across linear, quadratic, neural, and GPT-based contextual bandits validate our theoretical findings and demonstrate Ensemble++'s superior regret-computation tradeoff versus state-of-the-art methods.
Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models
Yonggan Fu · Xin Dong · Shizhe Diao · Matthijs Van keirsbilck · Hanrong Ye · Wonmin Byeon · Yashaswi Karnati · Lucas Liebenwein · Maksim Khadkevich · Alexander Keller · Jan Kautz · Yingyan (Celine) Lin · Pavlo Molchanov
Efficient deployment of small language models (SLMs) is essential for numerous real-world applications with stringent latency constraints.While previous work on SLM design has primarily focused on reducing the number of parameters to achieve parameter-optimal SLMs, parameter efficiency does not necessarily translate into proportional real-device speed-ups. This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training when real-device latency is the primary consideration. Specifically, we identify two central architectural factors: depth–width ratios and operator choices. The former is crucial for small-batch-size latency, while the latter affects both latency and large-batch-size throughput. In light of this, we first study latency-optimal depth–width ratios, with the key finding that although deep–thin models generally achieve better accuracy under the same parameter budget, they may not lie on the accuracy–latency trade-off frontier. Next, we explore emerging efficient attention alternatives to evaluate their potential as candidate building operators. Using the identified promising operators, we construct an evolutionary search framework to automatically discover latency-optimal combinations of these operators within hybrid SLMs, thereby advancing the accuracy–latency frontier. In addition to architectural improvements, we further enhance SLM training using a weight normalization technique that enables more effective weight updates and improves final convergence. This technique can serve as a generalizable component for future SLMs. Combining these methods, we introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy–efficiency frontier of state-of-the-art SLMs, e.g., achieving over +5.5\% average accuracy, 1.3$\times$/1.9$\times$ lower latency, and 18.7$\times$/45.6$\times$ higher throughput compared to Qwen3-1.7B/0.6B, respectively.
On the Stability of Graph Convolutional Neural Networks: A Probabilistic Perspective
Ning Zhang · Henry Kenlay · Li Zhang · Mihai Cucuringu · Xiaowen Dong
Graph convolutional neural networks (GCNNs) have emerged as powerful tools for analyzing graph-structured data, achieving remarkable success across diverse applications. However, the theoretical understanding of the stability of these models, i.e., their sensitivity to small changes in the graph structure, remains in rather limited settings, hampering the development and deployment of robust and trustworthy models in practice. To fill this gap, we study how small perturbations in the graph topology affect GCNN outputs and propose a novel formulation for analyzing model stability. Unlike prior studies that focus only on worst-case perturbations, our distribution-aware formulation characterizes output perturbations across a broad range of input data. This way, our framework enables, for the first time, a probabilistic perspective on the interplay between the statistical properties of the node data and perturbations in the graph topology. We conduct extensive experiments to validate our theoretical findings and demonstrate their benefits over existing baselines, in terms of both representation stability and adversarial attacks on downstream tasks. Our results demonstrate the practical significance of the proposed formulation and highlight the importance of incorporating data distribution into stability analysis.
Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing
Yilmazcan Ozyurt · Tunaberk Almaci · Stefan Feuerriegel · Mrinmaya Sachan
We introduce ExRec, a general framework for personalized exercise recommendation with semantically-grounded knowledge tracing. Our method builds on the observation that existing exercise recommendation approaches simulate student performance via knowledge tracing (KT) but they often overlook two key aspects: (a) the semantic content of questions and (b) the sequential, structured progression of student learning. To address this, our ExRec presents an end-to-end pipeline, from annotating the KCs of questions and learning their semantic representations to training KT models and optimizing several reinforcement learning (RL) methods. Moreover, we improve standard Q-learning-based continuous RL methods via a tailored model-based value estimation (MVE) approach that directly leverages the components of KT model in estimating cumulative knowledge improvement. We validate the effectiveness of our ExRec using various RL methods across four real-world tasks with different educational goals in online math learning. We further show that ExRec generalizes robustly to new, unseen questions and that it produces interpretable student learning trajectories. Together, our findings highlight the promise of KT-guided RL for effective personalization in education.
DeltaFormer: Unlock the state space of Transformer
Mingyu Xu · Tenglong Ao · Jiaao He · Jianqiao Lu · Guang Shi · Mingwu Zheng
In recent years, large language models with Transformer architecture as the core have made breakthrough progress in many fields. At the same time, there are also some weaknesses in the large language model that have prompted people to reflect, among which the most fundamental one is the reflection on the Transformer architecture. The Transformer architecture has high parallelism and can fully utilize the computing power of GPUs, thus replacing models such as LSTM in the past few years. However, high parallelism is not a free lunch, as it fundamentally limits the performance of models. Especially, the problems that logarithmic precision Transformer architecture can solve are strictly limited to the $TC^0$. And there are many important issues that are usually considered out of $TC^0$, such as Python code evaluation, entity tracking, chess, and other state tracking tasks. Meanwhile, some recent state space methods based on Delta Rule have been able to break through the $TC^0$ architecture, but they are limited by fixed size state spaces and perform poorly on many tasks. To this end, we have re-examined the Transformer from the perspective of a state space with kernel functions and propose an improved Transformer called DeltaFormer. We have theoretically and practically demonstrated that the proposed new architecture can break through the limitation of the inherent $TC^0$ expressivity of Transformers and verified that it is not weaker than standard Transformer in language modeling tasks. We hope our work can provide inspiration for designing more expressive models.
Reinforcement Learning with Imperfect Transition Predictions: A Bellman-Jensen Approach
Chenbei Lu · Zaiwei Chen · Tongxin Li · Chenye Wu · Adam Wierman
Traditional reinforcement learning (RL) assumes the agents make decisions based on Markov decision processes (MDPs) with one-step transition models. In many real-world applications, such as energy management and stock investment, agents can access multi-step predictions of future states, which provide additional advantages for decision making. However, multi-step predictions are inherently high-dimensional: naively embedding these predictions into an MDP leads to an exponential blow-up in state space and the curse of dimensionality. Moreover, existing RL theory provides few tools to analyze prediction-augmented MDPs, as it typically works on one-step transition kernels and cannot accommodate multi-step predictions with errors or partial action-coverage. We address these challenges with three key innovations: First, we propose the \emph{Bayesian value function} to characterize the optimal prediction-aware policy tractably. Second, we develop a novel \emph{Bellman–Jensen Gap} analysis on the Bayesian value function, which enables characterizing the value of imperfect predictions. Third, we introduce BOLA (Bayesian Offline Learning with Online Adaptation), a two-stage model-based RL algorithm that separates offline Bayesian value learning from lightweight online adaptation to real-time predictions. We prove that BOLA remains sample-efficient even under imperfect predictions. We validate our theory and algorithm on synthetic MDPs and a real-world wind energy storage control problem.
LABridge: Text–Image Latent Alignment Framework via Mean-Conditioned OU Process
Huiyang Shao · Xin Xia · Yuxi Ren · XING WANG · Xuefeng Xiao
Diffusion models have emerged as state‑of‑the‑art in image synthesis.However, it often suffer from semantic instability and slow iterative denoising. We introduce Latent Alignment Framework (LABridge), a novel Text–Image Latent Alignment Framework via an Ornstein–Uhlenbeck (OU) Process, which explicitly preserves and aligns textual and visual semantics in an aligned latent space. LABridge employs a Text-Image Alignment Encoder (TIAE) to encode text prompts into structured priors that are directly aligned with image latents. Instead of a homogeneous Gaussian, Mean-Conditioned OU process smoothly interpolates between these text‑conditioned priors and image latents, improving stability and reducing the number of denoising steps. Extensive experiments on standard text-to-image benchmarks show that LABridge achieves better text–image alignment metric and competitive FID scores compared to leading diffusion baselines. By unifying text and image representations through principled latent alignment, LABridge paves the way for more efficient, semantically consistent, and high‑fidelity text to image generation.
Understanding Parametric and Contextual Knowledge Reconciliation within Large Language Models
Jun Zhao · Yongzhuo Yang · Xiang Hu · Jingqi Tong · Yi Lu · Wei Wu · Tao Gui · Qi Zhang · Xuanjing Huang
Retrieval-Augmented Generation (RAG) provides additional contextual knowledge to complement the parametric knowledge in Large Language Models (LLMs). These two knowledge interweave to enhance the accuracy and timeliness of LLM responses. However, the internal mechanisms by which LLMs utilize these knowledge remain unclear. We propose modeling the forward propagation of knowledge as an entity flow, employing this framework to trace LLMs' internal behaviors when processing mixed-source knowledge. Linear probing utilizes a trainable linear classifier to detect specific attributes in hidden layers. However, once trained, a probe cannot adapt to dynamically specified entities. To address this challenge, we construct an entity-aware probe, which introduces special tokens to mark probing targets and employs a small trainable rank-8 lora update to process these special markers. We first verify this approach through an attribution experiment, demonstrating that it can accurately detect information about ad-hoc entities from complex hidden states. Next, we trace entity flows across layers to understand how LLMs reconcile conflicting knowledge internally. Our probing results reveal that contextual and parametric knowledge are routed between tokens through distinct sets of attention heads, supporting attention competition only within knowledge types. While conflicting knowledge maintains a residual presence across layers, aligned knowledge from multiple sources gradually accumulates, with the magnitude of this accumulation directly determining its influence on final outputs.
VLForgery Face Triad: Detection, Localization and Attribution via Multimodal Large Language Models
Xinan He · Yue Zhou · Bing Fan · Bin Li · Guopu Zhu · Feng Ding
Faces synthesized by diffusion models (DMs) with high-quality and controllable attributes pose a significant challenge for Deepfake detection. Most state-of-the-art detectors only yield a binary decision, incapable of forgery localization, attribution of forgery methods, and providing analysis on the cause of forgeries. In this work, we integrate Multimodal Large Language Models (MLLMs) within DM-based face forensics, and propose a fine-grained analysis triad framework called VLForgery, that can 1) predict falsified facial images; 2) locate the falsified face regions subjected to partial synthesis; and 3) attribute the synthesis with specific generators. To achieve the above goals, we introduce VLF (Visual Language Forensics), a novel and diverse synthesis face dataset designed to facilitate rich interactions between Visual' andLanguage' modalities in MLLMs. Additionally, we propose an extrinsic knowledge-guided description method, termed EkCot, which leverages knowledge from the image generation pipeline to enable MLLMs to quickly capture image content. Furthermore, we introduce a low-level vision comparison pipeline designed to identify differential features between real and fake that MLLMs can inherently understand. These features are then incorporated into EkCot, enhancing its ability to analyze forgeries in a structured manner, following the sequence of detection, localization, and attribution. Extensive experiments demonstrate that VLForgery outperforms other state-of-the-art forensic approaches in detection accuracy, with additional potential for falsified region localization and attribution analysis.
FedRAM: Federated Reweighting and Aggregation for Multi-Task Learning
Fan Wu · Xinyu Yan · Jiabei Liu · Wei Yang Bryan Lim
Federated Multi-Task Learning (FL-MTL) enables clients with heterogeneous data to collaboratively train models capable of handling multiple downstream tasks. However, FL-MTL faces key challenges, including statistical heterogeneity, task interference, and the need to balance local learning with global knowledge sharing. Traditional methods like FedAvg struggle in such settings due to the lack of explicit mechanisms to address these issues. In this paper, we propose FedRAM, a three-step framework that progressively updates two scalar hyperparameters: the task importance weight and the client aggregation coefficient. FedRAM introduces a reference-proxy-agent strategy, where the proxy model serves as an intermediate between the local reference model and the global agent model. This design reduces the need for repeated local training while preserving local performance. Extensive experiments on six real-world FL-MTL benchmarks show that FedRAM improves performance by at least 3$\%$ over the most baseline on both in-domain and out-of-domain tasks, while reducing computational cost by 15$\times$. These results make FedRAM a robust and practical solution for large-scale FL-MTL applications. The code is available at \url{https://github.com/wwffvv/FedRAM}.
Correlated Low-Rank Adaptation for ConvNets
Wu Ran · Weijia Zhang · ShuYang Pang · Zhu · Jinfan Liu · JingSheng Liu · Xin Cao · Qiang Li · Yichao Yan · Chao Ma
Low-Rank Adaptation (LoRA) methods have demonstrated considerable success in achieving parameter-efficient fine-tuning (PEFT) for Transformer-based foundation models. These methods typically fine-tune individual Transformer layers using independent LoRA adaptations. However, directly applying existing LoRA techniques to convolutional networks (ConvNets) yields unsatisfactory results due to the high correlation between the stacked sequential layers of ConvNets. To overcome this challenge, we introduce a novel framework called Correlated Low-Rank Adaptation (CoLoRA), which explicitly utilizes correlated low-rank matrices to model the inter-layer dependencies among convolutional layers. Additionally, to enhance tuning efficiency, we propose a parameter-free filtering method that enlarges the receptive field of LoRA, thus minimizing interference from non-informative local regions. Comprehensive experiments conducted across various mainstream vision tasks, including image classification, semantic segmentation, and object detection, illustrate that CoLoRA significantly advances the state-of-the-art PEFT approaches. Notably, our CoLoRA achieves superior performance with only 5\% of trainable parameters, surpassing full fine-tuning in the image classification task on the VTAB-1k dataset using ConvNeXt-S. Code is available at https://github.com/VISION-SJTU/CoLoRA.
STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis
Jiatao Gu · Tianrong Chen · David Berthelot · Huangjie Zheng · Yuyang Wang · Ruixiang Zhang · Laurent Dinh · Miguel Angel Bautista · Joshua Susskind · Shuangfei Zhai
We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance on high-resolution image synthesis. STARFlow's main building block is Transformer Autoregressive Flow (TARFlow), which combines normalizing flows with Autoregressive Transformer architectures and has recently achieved impressive results in image modeling. In this work, we first establish the theoretical universality of TARFlow for modeling continuous distributions. Building on this foundation, we introduce a set of architectural and algorithmic innovations that significantly enhance the scalability: (1) a deep-shallow design where a deep Transformer block captures most of the model’s capacity, followed by a few shallow Transformer blocks that are computationally cheap yet contribute non-negligibly, (2) learning in the latent space of pretrained autoencoders, which proves far more effective than modeling pixels directly, and (3) a novel guidance algorithm that substantially improves sample quality. Crucially, our model remains a single, end-to-end normalizing flow, allowing exact maximum likelihood training in continuous space without discretization. STARFlow achieves competitive results in both class- and text-conditional image generation, with sample quality approaching that of state-of-the-art diffusion models. To our knowledge, this is the first successful demonstration of normalizing flows at this scale and resolution. Code and weights available at https://github.com/apple/ml-starflow.
Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs
Yi Hu · Shijia Kang · Haotong Yang · Haotian Xu · Muhan Zhang
Length generalization—the ability to solve problems longer than those seen during training—remains a critical challenge for large language models (LLMs). Previous work modifies positional encodings (PEs) and data formats to improve length generalization on specific symbolic tasks such as addition and sorting. However, these approaches are fundamentally limited to special tasks, often degrading general language performance. Furthermore, they are typically evaluated on small transformers trained from scratch on single tasks and can cause performance drop when applied during post-training stage of practical LLMs with general capabilities. Hu et al., (2024) proposed Rule-Following Fine-Tuning (RFFT) to improve length generalization in the post-training stage of LLMs. Despite its compatibility with practical models and strong performance, RFFT is proposed for single tasks too, requiring re-training for each individual task with extensive examples. In this paper, we study length generalization in multi-task settings and propose Meta Rule-Following Fine-Tuning (Meta-RFFT), the first framework enabling robust cross-task length generalization. As our first contribution, we construct a large length generalization dataset containing 86 tasks spanning code execution, number processing, symbolic and logical reasoning tasks, beyond the common addition or multiplication tasks. Secondly, we show that cross-task length generalization is possible with Meta-RFFT—after training on a large number of tasks and instances, the models achieve remarkable length generalization ability on unseen tasks with minimal fine-tuning or one-shot prompting. For example, after fine-tuning on 1 to 5 digit addition, our 32B model achieves 95% accuracy on 30 digit addition, significantly outperforming the state-of-the-art reasoning models (DeepSeek-R1-671B: 72%; QwQ-32B: 32%), despite never seeing this task during RF-pretraining.
EvoBrain: Dynamic Multi-Channel EEG Graph Modeling for Time-Evolving Brain Networks
Rikuto Kotoge · Zheng Chen · Tasuku Kimura · Yasuko Matsubara · Takufumi Yanagisawa · Haruhiko Kishima · Yasushi Sakurai
Dynamic GNNs, which integrate temporal and spatial features in Electroencephalography (EEG) data, have shown great potential in automating seizure detection. However, fully capturing the underlying dynamics necessary to represent brain states, such as seizure and non-seizure, remains a non-trivial task and presents two fundamental challenges. First, most existing dynamic GNN methods are built on temporally fixed static graphs, which fail to reflect the evolving nature of brain connectivity during seizure progression. Second, current efforts to jointly model temporal signals and graph structures and, more importantly, their interactions remain nascent, often resulting in inconsistent performance. To address these challenges, we present the first theoretical analysis of these two problems, demonstrating the effectiveness and necessity of explicit dynamic modeling and time-then-graph dynamic GNN method. Building on these insights, we propose EvoBrain, a novel seizure detection model that integrates a two-stream Mamba architecture with a GCN enhanced by Laplacian Positional Encoding, following neurological insights. Moreover, EvoBrain incorporates explicitly dynamic graph structures, allowing both nodes and edges to evolve over time. Our contributions include (a) a theoretical analysis proving the expressivity advantage of explicit dynamic modeling and time-then-graph over other approaches, (b) a novel and efficient model that significantly improves AUROC by 23\% and F1 score by 30\%, compared with the dynamic GNN baseline, and (c) broad evaluation of our method on the challenging early seizure prediction task.
Non-rectangular Robust MDPs with Normed Uncertainty Sets
Navdeep Kumar · Adarsh Gupta · Maxence Mohamed ELFATIHI · Giorgia Ramponi · Kfir Y. Levy · Shie Mannor
Robust policy evaluation for non-rectangular uncertainty set is generally NP-hard, even in approximation. Consequently, existing approaches suffer from either exponential iteration complexity or significant accuracy gaps. Interestingly, we identify a powerful class of $L_p$-bounded uncertainty sets that avoid these complexity barriers due to their structural simplicity. We further show that this class can be decomposed into infinitely many \texttt{sa}-rectangular $L_p$-bounded sets and leverage its structural properties to derive a novel dual formulation for $L_p$ robust Markov Decision Processes (MDPs). This formulation reveals key insights into the adversary’s strategy and leads to the \textbf{first polynomial-time robust policy evaluation algorithm} for $L_1$-normed non-rectangular robust MDPs.
A Geometry-Aware Metric for Mode Collapse in Time Series Generative Models
Yassine ABBAHADDOU · Amine Aboussalah
Generative models such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and diffusion models often suffer from mode collapse, failing to reproduce the full diversity of their training data. While this problem has been extensively studied in image generation, it remains largely unaddressed for time series. We introduce a formal definition of mode collapse for time series and propose DMD-GEN, a geometry-aware metric that quantifies its severity. DMD-GEN leverages Dynamic Mode Decomposition (DMD) to extract coherent temporal structures and uses Optimal Transport between DMD eigenvectors to measure discrepancies in underlying dynamics. By representing the subspaces spanned by the DMD eigenvectors as point structures on a Grassmann manifold, and comparing them via Wasserstein distances computed from principal angles, DMD-GEN enables a principled geometric comparison between real and generated sequences. The metric is efficient, requiring no additional training, supports mini-batch evaluation, and is easily parallelizable. Beyond quantification, DMD-GEN offers interpretability by revealing which dynamical modes are distorted or missing in the generated data. Experiments on synthetic and real-world datasets using TimeGAN, TimeVAE, and DiffusionTS show that DMD-GEN aligns with existing metrics while providing the first principled framework for detecting and interpreting mode collapse in time series.
Cost-Sensitive Freeze-thaw Bayesian Optimization for Efficient Hyperparameter Tuning
Dong Bok Lee · Aoxuan Zhang · Byungjoo Kim · Junhyeon Park · Steven Adriaensen · Juho Lee · Sung Ju Hwang · Hae Beom Lee
In this paper, we address the problem of cost-sensitive hyperparameter optimization (HPO) built upon freeze-thaw Bayesian optimization (BO). Specifically, we assume a scenario where users want to early-stop the HPO process when the expected performance improvement is not satisfactory with respect to the additional computational cost. Motivated by this scenario, we introduce \emph{utility} in the freeze-thaw framework, a function describing the trade-off between the cost and performance that can be estimated from the user's preference data. This utility function, combined with our novel acquisition function and stopping criterion, allows us to dynamically continue training the configuration that we expect to maximally improve the utility in the future, and also automatically stop the HPO process around the maximum utility. Further, we improve the sample efficiency of existing freeze-thaw methods with transfer learning to develop a specialized surrogate model for the cost-sensitive HPO problem. We validate our algorithm on established multi-fidelity HPO benchmarks and show that it outperforms all the previous freeze-thaw BO and transfer-BO baselines we consider, while achieving a significantly better trade-off between the cost and performance.
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng · Kaixiong Gong · Bohao Li · Zonghao Guo · Yibing Wang · Tianshuo Peng · Junfei Wu · Xiaoying Zhang · Benyou Wang · Xiangyu Yue
Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 37.1\% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All code, models, and data will be released.
CSBrain: A Cross-scale Spatiotemporal Brain Foundation Model for EEG Decoding
Yuchen Zhou · Jiamin Wu · Zichen Ren · Zhouheng Yao · Weiheng Lu · Kunyu Peng · Qihao Zheng · Chunfeng Song · Wanli Ouyang · Chao Gou
Understanding and decoding human brain activity from electroencephalography (EEG) signals is a fundamental problem in neuroscience and artificial intelligence, with applications ranging from cognition and emotion recognition to clinical diagnosis and brain–computer interfaces. While recent EEG foundation models have made progress in generalized brain decoding by leveraging unified architectures and large-scale pretraining, they inherit a scale-agnostic dense modeling paradigm from NLP and vision. This design overlooks an intrinsic property of neural activity—cross-scale spatiotemporal structure. Different EEG task patterns span a broad range of temporal and spatial scales, from brief neural activations to slow-varying rhythms, and from localized cortical activations to large-scale distributed interactions. Ignoring this diversity may lead to suboptimal representations and weakened generalization ability. To address these limitations, we propose CSBrain, a Cross-scale Spatiotemporal Brain foundation model for generalized EEG decoding. CSBrain introduces two key components: (i) Cross-scale Spatiotemporal Tokenization (CST), which aggregates multi-scale features within localized temporal windows and anatomical brain regions into compact scale-aware token representations; and (ii) Structured Sparse Attention (SSA), which models cross-window and cross-region dependencies for diverse decoding tasks, further enriching scale diversities while eliminating the spurious dependencies. CST and SSA are alternately stacked to progressively integrate cross-scale spatiotemporal dependencies. Extensive experiments across 11 representative EEG tasks and 16 datasets demonstrate that CSBrain consistently outperforms both task-specific models and strong foundation baselines. These results establish cross-scale modeling as a key inductive bias for generalized EEG decoding and highlight CSBrain as a robust backbone for future brain–AI research.
Inference-time Alignment in Continuous Space
Yige Yuan · Teng Xiao · Li Yunfan · Bingbing Xu · Shuchang Tao · Yunqi Qiu · Huawei Shen · Xueqi Cheng
Aligning large language models with human feedback at inference time has received increasing attention due to its flexibility. Existing methods rely on generating multiple responses from the base policy for search using a reward model, which can be considered as searching in a discrete response space. However, these methods struggle to explore informative candidates when the base policy is weak or the candidate set is small, resulting in limited effectiveness. In this paper, to address this problem, we propose Simple Energy Adaptation ($\textbf{SEA}$), a simple yet effective algorithm for inference-time alignment. In contrast to expensive search over the discrete space, SEA directly adapts original responses from the base policy toward the optimal one via gradient-based sampling in continuous latent space. Specifically, SEA formulates inference as an iterative optimization procedure on an energy function over actions in the continuous space defined by the optimal policy, enabling simple and effective alignment. For instance, despite its simplicity, SEA outperforms the second-best baseline with a relative improvement of up to $ \textbf{77.51\%}$ on AdvBench and $\textbf{16.36\%}$ on MATH. Code is publicly available at [this link](https://github.com/yuanyige/sea).
TV-Rec: Time-Variant Convolutional Filter for Sequential Recommendation
Yehjin Shin · Jeongwhan Choi · Seojin Kim · Noseong Park
Recently, convolutional filters have been increasingly adopted in sequential recommendation for their ability to capture local sequential patterns. However, most of these models complement convolutional filters with self-attention. This is because convolutional filters alone, generally fixed filters, struggle to capture global interactions necessary for accurate recommendation. We propose \textbf{T}ime-\textbf{V}ariant Convolutional Filters for Sequential \textbf{Rec}ommendation (TV-Rec), a model inspired by graph signal processing, where time-variant graph filters capture position-dependent temporal variations in user sequences. By replacing both fixed kernels and self-attention with time-variant filters, TV-Rec achieves higher expressive power and better captures complex interaction patterns in user behavior. This design not only eliminates the need for self-attention but also reduces computation while accelerating inference. Extensive experiments on six public benchmarks show that TV-Rec outperforms state-of-the-art baselines by an average of 7.49\%.
When agents trade in a Duality-based Cost Function prediction market, they collectively implement the learning algorithm Follow-The-Regularized-Leader [Abernethy et al., 2013]. We ask whether other learning algorithms could be used to inspire the design of prediction markets. By decomposing and modifying the Duality-based Cost Function Market Maker's (DCFMM) pricing mechanism, we propose a new prediction market, called the Smooth Quadratic Prediction Market, the incentivizes agents to collectively implement general steepest gradient descent. Relative to the DCFMM, the Smooth Quadratic Prediction Market has a better worst-case monetary loss for AD securities while preserving axiom guarantees such as the existence of instantaneous price, information incorporation, expressiveness, no arbitrage, and a form of incentive compatibility. To motivate the application of the Smooth Quadratic Prediction Market, we independently examine agents' trading behavior under two realistic constraints: bounded budgets and buy-only securities. Finally, we provide an introductory analysis of an approach to facilitate adaptive liquidity using the Smooth Quadratic Prediction Market. Our results suggest future designs where the price update rule is separate from the fee structure, yet guarantees are preserved.
C3PO: Optimized Large Language Model Cascades with Probabilistic Cost Constraints for Reasoning
Antonios Valkanas · Soumyasundar Pal · Pavel Rumiantsev · Yingxue Zhang · Mark Coates
Large language models (LLMs) have achieved impressive results on complex reasoning tasks, but their high inference cost remains a major barrier to real-world deployment. A promising solution is to use cascaded inference, where small, cheap models handle easy queries, and only the hardest examples are escalated to more powerful models. However, existing cascade methods typically rely on supervised training with labeled data, offer no theoretical generalization guarantees, and provide limited control over test-time computational cost. We introduce C3PO (Cost Controlled Cascaded Prediction Optimization), a self-supervised framework for optimizing LLM cascades under probabilistic cost constraints. By focusing on minimizing regret with respect to the most powerful model (MPM), C3PO avoids the need for labeled data by constructing a cascade using only unlabeled model outputs. It leverages conformal prediction to bound the probability that inference cost exceeds a user-specified budget. We provide theoretical guarantees on both cost control and generalization error, and show that our optimization procedure is effective even with small calibration sets. Empirically, C3PO achieves state-of-the-art performance across a diverse set of reasoning benchmarks including GSM8K, MATH-500, BigBench-Hard and AIME, outperforming strong LLM cascading baselines in both accuracy and cost-efficiency. Our results demonstrate that principled, label-free cascade optimization can enable scalable LLM deployment.
Graph Alignment via Birkhoff Relaxation
Sushil Varma · Irène Waldspurger · Laurent Massoulié
We consider the graph alignment problem, wherein the objective is to find a vertex correspondence between two graphs that maximizes the edge overlap. The graph alignment problem is an instance of the quadratic assignment problem (QAP), known to be NP-hard in the worst case even to approximately solve. In this paper, we analyze Birkhoff relaxation, a tight convex relaxation of QAP, and present theoretical guarantees on its performance when the inputs follow the Gaussian Wigner Model. More specifically, the weighted adjacency matrices are correlated Gaussian Orthogonal Ensemble with correlation $1/\sqrt{1+\sigma^2}$. Denote the optimal solutions of the QAP and Birkhoff relaxation by $\Pi^\star$ and $X^\star$ respectively. We show that $\|X^\star-\Pi^\star\|_F^2 = o(n)$ when $\sigma = o(n^{-1})$ and $\|X^\star-\Pi^\star\|_F^2 = \Omega(n)$ when $\sigma = \Omega(n^{-0.5})$. Thus, the optimal solution $X^\star$ transitions from a small perturbation of $\Pi^\star$ for small $\sigma$ to being well separated from $\Pi^\star$ as $\sigma$ becomes larger than $n^{-0.5}$. This result allows us to guarantee that simple rounding procedures on $X^\star$ align $1-o(1)$ fraction of vertices correctly whenever $\sigma = o(n^{-1})$. This condition on $\sigma$ to ensure the success of the Birkhoff relaxation is state-of-the-art.
UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens
Ruichuan An · Sihan Yang · Renrui Zhang · zijun shen · Ming Lu · Gaole Dai · Hao Liang · Ziyu Guo · Shilin Yan · Yulin Luo · Bocheng Zou · Chaoqun Yang · Wentao Zhang
Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept $\langle bo\rangle$, generating "$\langle bo\rangle$ wearing its hat" without additional textual descriptions of its hat. We call this kind of generation personalized knowledge-driven generation. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and knowledge-driven generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized knowledge-driven generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding. Our code and dataset will be released.
GraphMaster: Automated Graph Synthesis via LLM Agents in Data-Limited Environments
Enjun Du · Xunkai Li · Tian Jin · Zhihan Zhang · Rong-Hua Li · Guoren Wang
The era of foundation models has revolutionized AI research, yet Graph Foundation Models (GFMs) remain constrained by the scarcity of large-scale graph corpora. Traditional graph data synthesis techniques primarily focus on simplistic structural operations, lacking the capacity to generate semantically rich nodes with meaningful textual attributes—a critical limitation for real-world applications. While large language models (LLMs) demonstrate exceptional text generation capabilities, their direct application to graph synthesis is impeded by context window limitations, hallucination phenomena, and structural consistency challenges. To address these issues, we introduce \textbf{GraphMaster}—the first multi-agent framework specifically designed for graph data synthesis in data-limited environments. GraphMaster orchestrates four specialized LLM agents (Manager, Perception, Enhancement, and Evaluation) that collaboratively optimize the synthesis process through iterative refinement, ensuring both semantic coherence and structural integrity. To rigorously evaluate our approach, we create new data-limited “Sub” variants of six standard graph benchmarks, specifically designed to test synthesis capabilities under realistic constraints. Additionally, we develop a novel interpretability assessment framework that combines human evaluation with a principled Grassmannian manifold-based analysis, providing both qualitative and quantitative measures of semantic coherence. Experimental results demonstrate that GraphMaster significantly outperforms traditional synthesis methods across multiple datasets, establishing a strong foundation for advancing GFMs in data-scarce environments.
SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning
Wufei Ma · Yu-Cheng Chou · Qihao Liu · Xingrui Wang · Celso de Melo · Jianwen Xie · Alan Yuille
Despite recent advances on multi-modal models, 3D spatial reasoning remains a challenging task for state-of-the-art open-source and proprietary models. Recent studies explore data-driven approaches and achieve enhanced spatial reasoning performance by fine-tuning models on 3D-related visual question-answering data. However, these methods typically perform spatial reasoning in an implicit manner and often fail on questions that are trivial to humans, even with long chain-of-thought reasoning. In this work, we introduce SpatialReasoner, a novel large vision-language model (LVLM) that addresses 3D spatial reasoning with explicit 3D representations shared between multiple stages--3D perception, computation, and reasoning. Explicit 3D representations provide a coherent interface that supports advanced 3D spatial reasoning and improves the generalization ability to novel question types. Furthermore, by analyzing the explicit 3D representations in multi-step reasoning traces of SpatialReasoner, we study the factual errors and identify key shortcomings of current LVLMs. Results show that our SpatialReasoner achieves improved performance on a variety of spatial reasoning benchmarks, outperforming Gemini 2.0 by 9.2% on 3DSRBench, and generalizes better when evaluating on novel 3D spatial reasoning questions. Our study bridges the 3D parsing capabilities of prior visual foundation models with the powerful reasoning abilities of large language models, opening new directions for 3D spatial reasoning.
VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning
Run Luo · Renke Shan · Longze Chen · Ziqiang Liu · Lu Wang · Min Yang · Xiaobo Xia
Large vision-language models (LVLMs) have emerged as foundational tools for real-world AI applications. Despite their remarkable capabilities, current LVLMs process entire images at the token level, leading to significant inefficiencies compared to human cognition, which selectively focuses on high-level vision concepts. This token-level redundancy becomes increasingly problematic for high-resolution images and long video sequences, resulting in large computational costs and limited scalability in practical applications. To address this limitation, we introduce the concept of a vision concept model, a novel paradigm that enables LVLMs to dynamically extract the most relevant vision concepts from complex inputs, based on task-specific instructions. To optimize this vision concept modeling process, we propose VCM, a self-supervised framework that leverages vision-language correlations across diverse instances. VCM is designed to learn meaningful vision concepts without the need for expensive concept-level annotations. At its core, it employs a forward-backward optimization algorithm that supports LVLMs to adjust concept granularity and spatial alignment dynamically. Experiments demonstrate that VCM remarkably reduces computational costs (e.g., achieving up to 85\% fewer FLOPs for LLaVA-1.5-7B), while maintaining strong performance across a series of vision-language tasks. The codebase is available at https://github.com/RainBowLuoCS/VCM.
Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning
Boheng Li · Renjie Gu · Junjie Wang · Leyi Qi · Yiming Li · Run Wang · Zhan Qin · Tianwei Zhang
Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are found to be fragile to downstream fine-tuning, as we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety, while effectively preserving benign generation capability. Our code and pretrained models are publicly available at https://github.com/AntigoneRandy/ResAlign.
Non-Asymptotic Guarantees for Average-Reward Q-Learning with Adaptive Stepsizes
Zaiwei Chen
This work presents the first finite-time analysis of average-reward $Q$-learning with an asynchronous implementation. A key feature of the algorithm we study is the use of adaptive stepsizes that act as local clocks for each state-action pair. We show that the mean-square error of this $Q$-learning algorithm, measured in the span seminorm, converges at a rate of $\smash{\tilde{\mathcal{O}}(1/k)}$. To establish this result, we demonstrate that adaptive stepsizes are necessary: without them, the algorithm fails to converge to the correct target. Moreover, adaptive stepsizes can be viewed as a form of implicit importance sampling that counteracts the effect of asynchronous updates. Technically, the use of adaptive stepsizes causes each $Q$-learning update to depend on the full sample history, introducing strong correlations and making the algorithm a non-Markovian stochastic approximation (SA) scheme. Our approach to overcoming this challenge involves (1) a time-inhomogeneous Markovian reformulation of non-Markovian SA, and (2) a combination of almost-sure time-varying bounds, conditioning arguments, and Markov chain concentration inequalities to break the strong correlations between the adaptive stepsizes and the iterates.
AI-Researcher: Autonomous Scientific Innovation
Jiabin Tang · Lianghao Xia · Zhonghang Li · Chao Huang
The powerful reasoning capabilities of Large Language Models (LLMs) in mathematics and coding, combined with their ability to automate complex tasks through agentic frameworks, present unprecedented opportunities for accelerating scientific innovation. In this paper, we introduce AI-Researcher, a fully autonomous research system that transforms how AI-driven scientific discovery is conducted and evaluated. Our framework seamlessly orchestrates the complete research pipeline--from literature review and hypothesis generation to algorithm implementation and publication-ready manuscript preparation--with minimal human intervention. To rigorously assess autonomous research capabilities, we develop Scientist-Bench, a comprehensive benchmark comprising state-of-the-art papers across diverse AI research domains, featuring both guided innovation and open-ended exploration tasks. Through extensive experiments, we demonstrate that AI-Researcher achieves remarkable implementation success rates and produces research papers that approach human-level quality. This work establishes new foundations for autonomous scientific innovation that can complement human researchers by systematically exploring solution spaces beyond cognitive limitations.
Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting
Yuxuan Yang · Dalin Zhang · Yuxuan Liang · Hua Lu · Gang Chen · Huan Li
Time Series Forecasting (TSF) is a crucial task in various domains, yet existing TSF models rely heavily on high-quality data and insufficiently exploit all available data. This paper explores a novel self-supervised approach to re-label time series datasets by inherently constructing candidate datasets. During the optimization of a simple reconstruction network, intermediates are used as pseudo labels in a self-supervised paradigm, improving generalization for any predictor. We introduce the Self-Correction with Adaptive Mask (SCAM), which discards overfitted components and selectively replaces them with pseudo labels generated from reconstructions. Additionally, we incorporate Spectral Norm Regularization (SNR) to further suppress overfitting from a loss landscape perspective. Our experiments on eleven real-world datasets demonstrate that SCAM consistently improves the performance of various backbone models. This work offers a new perspective on constructing datasets and enhancing the generalization of TSF models through self-supervised learning. The code is available at https://github.com/SuDIS-ZJU/SCAM.
Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL
Joey Hong · Anca Dragan · Sergey Levine
Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions predict how a task will unfold given an action, allowing the LLM agent to evaluate multiple possible outcomes, both positive and negative, to plan effectively. In addition, these value functions are trained over reasoning steps rather than full actions, to be a concise and light-weight module that facilitates decision-making in multi-turn interactions. We validate our method on tasks requiring interaction, including tool use, social deduction, and dialogue, demonstrating superior performance over both RL fine-tuning and prompting methods while maintaining efficiency and scalability.
Individual Regret in Cooperative Stochastic Multi-Armed Bandits
Idan Barnea · Tal Lancewicki · Yishay Mansour
We study the regret in stochastic Multi-Armed Bandits (MAB) with multiple agents that communicate over an arbitrary connected communication graph. We analyzed a variant of Cooperative Successive Elimination algorithm, $\texttt{Coop-SE}$, and show an individual regret bound of ${O}(\mathcal{R} / m + A^2 + A \sqrt{\log T})$ and a nearly matching lower bound. Here $A$ is the number of actions, $T$ the time horizon, $m$ the number of agents, and $\mathcal{R} = \sum_{\Delta_i > 0}\log(T)/\Delta_i$ is the optimal single agent regret, where $\Delta_i$ is the sub-optimality gap of action $i$. Our work is the first to show an individual regret bound in cooperative stochastic MAB that is independent of the graph's diameter. When considering communication networks there are additional considerations beyond regret, such as message size and number of communication rounds. First, we show that our regret bound holds even if we restrict the messages to be of logarithmic size. Second, for logarithmic number of communication rounds, we obtain a regret bound of ${O}(\mathcal{R} / m+A \log T)$.
Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback
Orin Levy · Liad Erez · Alon Peled-Cohen · Yishay Mansour
We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over $K$ actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class $\Pi$ we establish an optimal expected regret bound of $ O (\sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|)} $ where $D$ is the sum of delays. For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class $ \mathcal{F} $ with access to an online least-square regression oracle $\mathcal{O}$ over $\mathcal{F}$. In this setting, we achieve an expected regret bound of $O(\sqrt{KTR_T(\mathcal{O})} + \sqrt{ d_{\max} D \beta})$ assuming FIFO order, where $d_{\max}$ is the maximal delay, $R_T(\mathcal{O})$ is an upper bound on the oracle's regret and $\beta$ is a stability parameter associated with the oracle. We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk's aggregating forecaster as an oracle implementation for least-square regression over a finite function class $\mathcal{F}$ and show that its stability parameter $\beta$ is bounded by $\log |\mathcal{F}|$, resulting in an expected regret bound of $O(\sqrt{KT \log |\mathcal{F}|} + \sqrt{d_{\max} D \log |\mathcal{F}|})$ which is a $\sqrt{d_{\max}}$ factor away from the lower bound of $\Omega(\sqrt{KT \log |\mathcal{F}|} + \sqrt{D \log |\mathcal{F}|})$ that we also present.
On the Expressive Power of Mixture-of-Experts for Structured Complex Tasks
Mingze Wang · Weinan E
Mixture-of-experts networks (MoEs) have demonstrated remarkable efficiency in modern deep learning. Despite their empirical success, the theoretical foundations underlying their ability to model complex tasks remain poorly understood. In this work, we conduct a systematic study of the expressive power of MoEs in modeling complex tasks with two common structural priors: low-dimensionality and sparsity. For shallow MoEs, we prove that they can efficiently approximate functions supported on low-dimensional manifolds, overcoming the curse of dimensionality. For deep MoEs, we show that $\mathcal{O}(L)$-layer MoEs with $E$ experts per layer can approximate piecewise functions comprising $E^L$ pieces with compositional sparsity, i.e., they can exhibit an exponential number of structured tasks. Our analysis reveals the roles of critical architectural components and hyperparameters in MoEs, including the gating mechanism, expert networks, the number of experts, and the number of layers, and offers natural suggestions for MoE variants.
Sparc3D: Sparse Representation and Construction for High-Resolution 3D Shapes Modeling
Zhihao Li · Yufei Wang · Heliang Zheng · Yihao Luo · Bihan Wen
High-fidelity 3D object synthesis remains significantly more challenging than 2D image generation due to the unstructured nature of mesh data and the cubic complexity of dense volumetric grids. Existing two-stage pipelines—compressing meshes with a VAE (using either 2D or 3D supervision), followed by latent diffusion sampling—often suffer from severe detail loss caused by inefficient representations and modality mismatches introduced in VAE. We introduce Sparc3D, a unified framework that combines a sparse deformable marching cubes representation Sparcubes with a novel encoder Sparconv-VAE. Sparcubes converts raw meshes into high-resolution ($1024^3$) surfaces with arbitrary topology by scattering signed distance and deformation fields onto a sparse cube, allowing differentiable optimization. Sparconv-VAE is the first modality-consistent variational autoencoder built entirely upon sparse convolutional networks, enabling efficient and near-lossless 3D reconstruction suitable for high-resolution generative modeling through latent diffusion. Sparc3D achieves state-of-the-art reconstruction fidelity on challenging inputs, including open surfaces, disconnected components, and intricate geometry. It preserves fine-grained shape details, reduces training and inference cost, and integrates naturally with latent diffusion models for scalable, high-resolution 3D generation.
Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting
Anand Bhattad · Konpat Preechakul · Alexei Efros
This paper proposes a novel scene understanding task called Visual Jenga. Drawing inspiration from the game Jenga, the proposed task involves progressively removing objects from a single image until only the background remains. Just as Jenga players must understand structural dependencies to maintain tower stability, our task reveals the intrinsic relationships between scene elements by systematically exploring which objects can be removed while preserving scene coherence in both physical and geometric sense. As a starting point for tackling the Visual Jenga task, we propose a simple, data-driven, training-free approach that is surprisingly effective on a range of real-world images. The principle behind our approach is to utilize the asymmetry in the pairwise relationships between objects within a scene and employ a large inpainting model to generate a set of counterfactuals to quantify the asymmetry.
ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints
Rui Xu · Dakuan Lu · Zicheng Zhao · Xiaoyu Tan · Xintao Wang · Siyu Yuan · Jiangjie Chen · yinghui xu
Spatial reasoning is a key capability in the field of artificial intelligence, especially crucial in areas such as robotics, computer vision, and natural language understanding. However, evaluating the ability of multimodal large language models (MLLMs) in complex spatial reasoning still faces challenges, particularly in scenarios requiring multi-step reasoning and precise mathematical constraints. This paper introduces ORIGAMISPACE, a new dataset and benchmark designed to evaluate the multi-step spatial reasoning ability and the capacity to handle mathematical constraints of MLLMs through origami tasks. The dataset contains 350 data instances, each comprising a strictly formatted crease pattern (CP diagram), the Compiled Flat Pattern, the complete Folding Process, and the final Folded Shape Image. We propose four evaluation tasks: Pattern Prediction, Multi-step Spatial Reasoning, Spatial Relationship Prediction, and End-to-End CP Code Generation. For the CP code generation task, we design an interactive environment and explore the possibility of using reinforcement learning methods to train MLLMs. Through experiments on existing MLLMs, we initially reveal the strengths and weaknesses of these models in handling complex spatial reasoning tasks.
macOSWorld: A Multilingual Interactive Benchmark for GUI Agents
Pei Yang · Hai Ci · Mike Zheng Shou
Graphical User Interface (GUI) agents show promising capabilities for automating computer-use tasks and facilitating accessibility, but existing interactive benchmarks are mostly English-only, covering web-use or Windows, Linux, and Android environments, but not macOS. macOS is a major OS with distinctive GUI patterns and exclusive applications. To bridge the gaps, we present macOSWorld, the first comprehensive benchmark for evaluating GUI agents on macOS. macOSWorld features 202 multilingual interactive tasks across 30 applications (28 macOS-exclusive), with task instructions and OS interfaces offered in 5 languages (English, Chinese, Arabic, Japanese, and Russian). As GUI agents are shown to be vulnerable to deception attacks, macOSWorld also includes a dedicated safety benchmarking subset. Our evaluation on six GUI agents reveals a dramatic gap: proprietary computer-use agents lead at above 30\% success rate, while open-source lightweight research models lag at below 5\%, highlighting the need for macOS domain adaptation. Multilingual benchmarks also expose common weaknesses, especially in Arabic, with a 28.8\% average degradation compared to English. Results from safety benchmarking also highlight that deception attacks are more general and demand immediate attention. Project page: https://macos-world.github.io.
Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models
Yulei Qin · Gang Li · Zongyi Li · Zihan Xu · Yuchen Shi · Zhekai Lin · Xiao Cui · Ke Li · Xing Sun
Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose RAIF, a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Evaluation on OOD constraints also confirms the generalizability of our RAIF.
Generalized Category Discovery under Domain Shift: A Frequency Domain Perspective
Wei Feng · Zongyuan Ge
Generalized Category Discovery (GCD) aims to leverage labeled samples from known categories to cluster unlabeled data that may include both known and unknown categories. While existing methods have achieved impressive results under standard conditions, their performance often deteriorates in the presence of distribution shifts. In this paper, we explore a more realistic task: Domain-Shifted Generalized Category Discovery (DS_GCD), where the unlabeled data includes not only unknown categories but also samples from unknown domains. To tackle this challenge, we propose a \textbf{\underline{F}}requency-guided Gene\textbf{\underline{r}}alized Cat\textbf{\underline{e}}gory Discov\textbf{\underline{e}}ry framework (FREE) that enhances the model's ability to discover categories under distributional shift by leveraging frequency-domain information. Specifically, we first propose a frequency-based domain separation strategy that partitions samples into known and unknown domains by measuring their amplitude differences. We then propose two types of frequency-domain perturbation strategies: a cross-domain strategy, which adapts to new distributions by exchanging amplitude components across domains, and an intra-domain strategy, which enhances robustness to intra-domain variations within the unknown domain. Furthermore, we extend the self-supervised contrastive objective and semantic clustering loss to better guide the training process. Finally, we introduce a clustering-difficulty-aware resampling technique to adaptively focus on harder-to-cluster categories, further enhancing model performance. Extensive experiments demonstrate that our method effectively mitigates the impact of distributional shifts across various benchmark datasets and achieves superior performance in discovering both known and unknown categories.
Variational Supervised Contrastive Learning
Ziwen Wang · Jiajun Fan · Thao Nguyen · Heng Ji · Ge Liu
Contrastive learning has proven to be highly efficient and adaptable in shaping representation spaces across diverse modalities by pulling similar samples together and pushing dissimilar ones apart. However, two key limitations persist: (1) Without explicit regulation of the embedding distribution, semantically related instances can inadvertently be pushed apart unless complementary signals guide pair selection, and (2) excessive reliance on large in-batch negatives and tailored augmentations hinders generalization. To address these limitations, we propose Variational Supervised Contrastive Learning (VarCon), which reformulates supervised contrastive learning as variational inference over latent class variables and maximizes a posterior-weighted evidence lower bound (ELBO) that replaces exhaustive pair-wise comparisons for efficient class-aware matching and grants fine-grained control over intra-class dispersion in the embedding space. Trained exclusively on image data, our experiments on CIFAR-10, CIFAR-100, ImageNet-100, and ImageNet-1K show that VarCon (1) achieves state-of-the-art performance for contrastive learning frameworks, reaching 79.36% Top-1 accuracy on ImageNet-1K and 78.29% on CIFAR-100 with a ResNet-50 encoder while converging in just 200 epochs; (2) yields substantially clearer decision boundaries and semantic organization in the embedding space, as evidenced by KNN classification, hierarchical clustering results, and transfer-learning assessments; and (3) demonstrates superior performance in few-shot learning than supervised baseline and superior robustness across various augmentation strategies.
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
Junyoung Park · Dalton Jones · Matthew Morse · Raghavv Goel · Mingu Lee · Christopher Lott
We demonstrate that geometrically distinctive keys during LLM inference tend to have high attention scores. Based on the phenomenon we propose KeyDiff, a training-free KV cache eviction method based solely on key similarity. Unlike other KV cache eviction methods, KeyDiff can process arbitrarily long prompts within strict resource constraints and efficiently generate responses. We provide a theoretical basis for KeyDiff by relating key diversity with attention scores. These results imply KeyDiff can efficiently identify the most important tokens to retain. Notably KeyDiff does not rely on attention scores, allowing the use of optimized attention mechanisms like FlashAttention. Under a strict memory allowance, we demonstrate the effectiveness of KeyDiff for the Llama and Qwen model families by observing a performance gap of less than 0.04\% with 8K cache budget (~23\% KV cache reduction) from the non-evicting baseline on LongBench for Llama 3.1-8B and Llama 3.2-3B. We also observe near baseline performance for Deepseek-R1-Distill-Llama-8B on the Math500 reasoning benchmark and decrease end-to-end inference latency by up to 30\% compared to the other token-eviction methods.
Can MLLMs Absorb Math Reasoning Abilities from LLMs as Free Lunch?
Yijie Hu · Zihao Zhou · Kaizhu Huang · Xiaowei Huang · Qiufeng Wang
Math reasoning has been one crucial ability of large language models (LLMs), where significant advancements have been achieved in recent years. However, most efforts focus on LLMs by curating high-quality annotation data and intricate training (or inference) paradigms, while the math reasoning performance of multi-modal LLMs (MLLMs) remains lagging behind. Since the MLLM typically consists of an LLM and vision block, we wonder: \textit{Can MLLMs directly absorb math reasoning abilities from off-the-shelf math LLMs without tuning?} Recent model-merging approaches may offer insights into this question. However, they overlook the alignment between the MLLM and LLM, where we find that there is a large gap between their parameter spaces, resulting in lower performance. Our empirical evidence reveals two key factors behind this issue: the identification of crucial reasoning-associated layers in the model and the mitigation of the gaps in parameter space. Based on the empirical insights, we propose \textbf{IP-Merging} that first \textbf{I}dentifies the reasoning-associated parameters in both MLLM and Math LLM, then \textbf{P}rojects them into the subspace of MLLM aiming to maintain the alignment, finally merges parameters in this subspace. IP-Merging is a tuning-free approach since parameters are directly adjusted. Extensive experiments demonstrate that our IP-Merging method can enhance the math reasoning ability of MLLMs directly from Math LLMs without compromising their other capabilities.
OmniSVG: A Unified Scalable Vector Graphics Generation Model
Yiying Yang · Wei Cheng · Sijin Chen · Xianfang Zeng · Fukun Yin · Jiaxu Zhang · Liao Wang · Gang Yu · Xingjun Ma · Yu-Gang Jiang
Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both designers and researchers in the AIGC community. However, existing methods either produces unstructured outputs with huge computational cost or is limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG, we propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodal SVG generation. By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the expressiveness of complex SVG structure. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows.
Robust SuperAlignment: Weak-to-Strong Robustness Generalization for Vision-Language Models
Junhao Dong · Cong Zhang · Xinghua Qu · Zejun MA · Piotr Koniusz · Yew Soon Ong
Numerous well-established studies have demonstrated the superhuman capabilities of modern Vision-Language Models (VLMs) across a wide range of tasks. However, growing is the doubt about the continuing availability of reliable high-quality labeling (supervision) from human annotators, leading to stagnation of the model's performance. To address this challenge, ``superalignment'' employs the so-called weak-to-strong generalization paradigm, where the supervision from a weak model can provide generalizable knowledge for a strong model. While effective in aligning knowledge for clean samples between the strong and weak models, the standard weak-to-strong approach typically fails to capture adversarial robustness, exposing strong VLMs to adversarial attacks. This inability to transfer adversarial robustness is because adversarial samples are normally missing in the superalignment stage. To this end, we are the first to propose the weak-to-strong (adversarial) robustness generalization method to elicit zero-shot robustness in large-scale models by an unsupervised scheme, mitigating the unreliable information source for alignment from two perspectives: alignment re-weighting and source guidance refinement. We analyze settings under which robustness generalization is possible. Extensive experiments across various vision-language benchmarks validate the effectiveness of our method in numerous scenarios, demonstrating its plug-and-play applicability to large-scale VLMs.
Balancing Gradient and Hessian Queries in Non-Convex Optimization
Deeksha Adil · Brian Bullins · Aaron Sidford · Chenyi Zhang
We develop optimization methods which offer new trade-offs between the number of gradient and Hessian computations needed to compute the critical point of a non-convex function. We provide a method that for a twice-differentiable $f\colon \mathbb{R}^d \rightarrow \mathbb{R}$ with $L_2$-Lipschitz Hessian, and input initial point with $\Delta$-bounded sub-optimality and sufficiently small $\epsilon > 0$ outputs an $\epsilon$-critical point, i.e., a point $x$ such that $\|\nabla f(x)\| \leq \epsilon$, using $\tilde{O}(\Delta L_2^{1/4} n_H^{-1/2}\epsilon^{-9/4})$ queries to a gradient oracle and $n_H$ queries to a Hessian oracle. As a consequence, we obtain an improved gradient query complexity of $\tilde{O}(d^{1/3}L_2^{1/2}\Delta\epsilon^{-3/2})$ in the case of bounded dimension and of $\tilde{O}(\Delta^{3/2} L_2^{3/4}\epsilon^{-9/4})$ in the case where we are allowed only a single Hessian query. We obtain these results through a more general algorithm which can handle approximate Hessian computations and recovers known prior state-of-the-art bounds of computing an $\epsilon$-critical point, under the additional assumption that $f$ has an $L_1$-Lipschitz gradient, with $O(\Delta L_2^{1/4}\epsilon^{-7/4})$-gradient queries.
WebDancer: Towards Autonomous Information Seeking Agency
Jialong Wu · Baixuan Li · Runnan Fang · Wenbiao Yin · Liwen Zhang · Zhenglin Wang · Zhengwei Tao · Ding-Chu Zhang · Zekun Xi · Robert Tang · Yong Jiang · Pengjun Xie · Fei Huang · Jingren Zhou
Addressing intricate real-world problems necessitates in-depth information seeking and multi-step reasoning. Recent progress in agentic systems, exemplified by Deep Research, underscores the potential for autonomous multi-step research. In this work, we present a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective. Our approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine-tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. We instantiate this framework in a web agent based on the ReAct format, WebDancer. Empirical evaluations on the challenging GAIA and WebWalkerQA benchmarks demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of our training paradigm. Further analysis of agent training provides valuable insights and actionable, systematic pathways for developing more capable agentic models.
MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem
Fan LIU · Zherui Yang · Cancheng Liu · Tianrui Song · Xiaofeng Gao · Hao Liu
Mathematical modeling is a cornerstone of scientific discovery and engineering practice, enabling the translation of real-world problems into formal systems across domains such as physics, biology, and economics. Unlike mathematical reasoning, which assumes a predefined formulation, modeling requires open-ended problem analysis, abstraction, and principled formalization. While Large Language Models (LLMs) have shown strong reasoning capabilities, they fall short in rigorous model construction, limiting their utility in real-world problem-solving. To this end, we formalize the task of LLM-powered real-world mathematical modeling, where agents must analyze problems, construct domain-appropriate formulations, and generate complete end-to-end solutions.We introduce MM-Bench, a curated benchmark of 111 problems from the Mathematical Contest in Modeling (MCM/ICM), spanning the years 2000 to 2025 and across ten diverse domains such as physics, biology, and economics. To tackle this task, we propose MM-Agent, an expert-inspired framework that decomposes mathematical modeling into four stages: open-ended problem analysis, structured model formulation, computational problem solving, and report generation.Experiments on MM-Bench show that MM-Agent significantly outperforms baseline agents, achieving an 11.88\% improvement over human expert solutions while requiring only 15 minutes and \$0.88 per task using GPT-4o. Furthermore, under official MCM/ICM protocols, MM-Agent assisted two undergraduate teams in winning the Finalist Award (\textbf{top 2.0\% among 27,456 teams}) in MCM/ICM 2025, demonstrating its practical effectiveness as a modeling copilot.
Generative Pre-trained Autoregressive Diffusion Transformer
Yuan Zhang · Jiacheng Jiang · Guoqing Ma · Zhiying Lu · Bo Wang · Haoyang Huang · Jianlong Yuan · Nan Duan
In this work, we present GPDiT, a Generative Pre-trained Autoregressive Diffusion Transformer that unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis, within a continuous latent space. Instead of predicting discrete tokens, GPDiT autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics and semantic consistency across frames. This continuous autoregressive framework not only enhances generation quality but also endows the model with representation capabilities. Additionally, we introduce a lightweight causal attention variant and a parameter-free rotation-based time-conditioning mechanism, improving both the training and inference efficiency. Extensive experiments demonstrate that GPDiT achieves strong performance in video generation quality, video representation ability, and few-shot learning tasks, highlighting its potential as an effective framework for video modeling in continuous space.
Language Models (Mostly) Know When to Stop Reading
Roy Xie · Junlin Wang · Paul Rosu · Chunyuan Deng · Bolun Sun · Zihao Lin · Bhuwan Dhingra
Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient when the information required to answer a query is localized within the context. We present dynamic context cutoff, a novel method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode "sufficiency signals" -- detectable through lightweight classifiers -- that predict when critical information has been processed. This reveals a new efficiency paradigm: models' internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B-70B) demonstrate 3.4% accuracy improvement while achieving 1.33x token reduction on average. Furthermore, our method demonstrates superior performance compared to other context efficiency methods at equivalent token reduction rates. Additionally, we observe an emergent scaling phenomenon: while smaller models require probing for sufficiency detection, larger models exhibit intrinsic self-assessment capabilities through prompting.
VarFlow: Proper Scoring-Rule Diffusion Distillation via Energy Matching
Huiyang Shao · Xin Xia · Yuxi Ren · XING WANG · Xuefeng Xiao
**Diffusion models** achieve remarkable generative performance but are hampered by slow, iterative inference. Model distillation seeks to train a fast student generator. **Variational Score Distillation (VSD)** offers a principled KL-divergence minimization framework for this task. This method cleverly avoids computing the teacher model's Jacobian, but its student gradient relies on the score of the student's own noisy marginal distribution, $\nabla\_{\mathbf{x}\_t} \log p\_{\phi,t}(\mathbf{x}\_t)$. VSD thus requires approximations, such as training an auxiliary network to estimate this score. These approximations can introduce biases, cause training instability, or lead to an incomplete match of the target distribution, potentially focusing on conditional means rather than broader distributional features. We introduce **VarFlow**, a method based on a **Score-Rule Variational Distillation (SRVD)** framework. VarFlow trains a one-step generator $g_{\phi}(\mathbf{z})$ by directly minimizing an energy distance (derived from the strictly proper energy score) between the student's induced noisy data distribution $p_{\phi,t}(\mathbf{x}_t)$ and the teacher's target noisy distribution $q_t(\mathbf{x}_t)$. This objective is estimated entirely using samples from these two distributions. Crucially, VarFlow bypasses the need to compute or approximate the intractable student score. By directly matching the full noisy marginal distributions, VarFlow aims for a more comprehensive and robust alignment between student and teacher, offering an efficient and theoretically grounded path to high-fidelity one-step generation.
Training-Free Bayesianization for Low-Rank Adapters of Large Language Models
Haizhou Shi · Yibin Wang · Ligong Han · Huan Zhang · Hao Wang
Estimating the uncertainty of responses from Large Language Models (LLMs) remains a critical challenge. While recent Bayesian methods have demonstrated effectiveness in quantifying uncertainty through low-rank weight updates, they typically require complex fine-tuning or post-training procedures. In this paper, we propose Training-Free Bayesianization (TFB), a simple yet theoretically grounded framework that efficiently transforms trained low-rank adapters into Bayesian ones without additional training. TFB systematically searches for the maximally acceptable level of variance in the weight posterior, constrained within a family of low-rank isotropic Gaussian distributions. Our theoretical analysis shows that under mild conditions, this search process is equivalent to KL-regularized variational optimization, a generalized form of variational inference. Through comprehensive experiments, we show that TFB achieves superior uncertainty estimation and generalization compared to existing methods while eliminating the need for complex Bayesianization training procedures.
Rethinking Approximate Gaussian Inference in Classification
Bálint Mucsányi · Nathaël Da Costa · Philipp Hennig
In classification tasks, softmax functions are ubiquitously used as output activations to produce predictive probabilities. Such outputs only capture aleatoric uncertainty. To capture epistemic uncertainty, approximate Gaussian inference methods have been proposed. We develop a common formalism to describe such methods, which we view as outputting Gaussian distributions over the logit space. Predictives are then obtained as the expectations of the Gaussian distributions pushed forward through the softmax. However, such softmax Gaussian integrals cannot be solved analytically, and Monte Carlo (MC) approximations can be costly and noisy. We propose to replace the softmax activation with element-wise normCDF or sigmoid, which allows for the accurate sampling-free approximation of predictives. This also enables the approximation of the Gaussian pushforwards by Dirichlet distributions with moment matching. This approach entirely eliminates the runtime and memory overhead associated with MC sampling. We evaluate it combined with several approximate Gaussian inference methods (Laplace, HET, SNGP) on large- and small-scale datasets (ImageNet, CIFAR-100, CIFAR-10), demonstrating improved uncertainty quantification capabilities compared to softmax MC sampling. Our code is available at https://github.com/bmucsanyi/probit.
Fisher meets Feynman: score-based variational inference with a product of experts
Diana Cai · Robert Gower · David Blei · Lawrence Saul
We introduce a highly expressive yet distinctly tractable family for black-box variational inference (BBVI). Each member of this family is a weighted product of experts (PoE), and each weighted expert in the product is proportional to a multivariate $t$-distribution. These products of experts can model distributions with skew, heavy tails, and multiple modes, but to use them for BBVI, we must be able to sample from their densities. We show how to do this by reformulating these products of experts as latent variable models with auxiliary Dirichlet random variables. These Dirichlet variables emerge from a Feynman identity, originally developed for loop integrals in quantum field theory, that expresses the product of multiple fractions (or in our case, $t$-distributions) as an integral over the simplex. We leverage this simplicial latent space to draw weighted samples from these products of experts---samples which BBVI then uses to find the PoE that best approximates a target density. Given a collection of experts, we derive an iterative procedure to optimize the exponents that determine their geometric weighting in the PoE. At each iteration, this procedure minimizes a regularized Fisher divergence to match the scores of the variational and target densities at a batch of samples drawn from the current approximation. This minimization reduces to a convex quadratic program, and we prove under general conditions that these updates converge exponentially fast to a near-optimal weighting of experts. We conclude by evaluating this approach on a variety of synthetic and real-world target distributions.
Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case
Iskander Azangulov · Andrei Smolensky · Alexander Terenin · Viacheslav (Slava) Borovitskiy
Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.
Spectral Estimation with Free Decompression
Siavash Ameli · Chris van der Heide · Liam Hodgkinson · Michael Mahoney
Computing eigenvalues of very large matrices is a critical task in many machine learning applications, including the evaluation of log-determinants, the trace of matrix functions, and other important metrics. As datasets continue to grow in scale, the corresponding covariance and kernel matrices become increasingly large, often reaching magnitudes that make their direct formation impractical or impossible. Existing techniques typically rely on matrix-vector products, which can provide efficient approximations, if the matrix spectrum behaves well. However, in settings like distributed learning, or when the matrix is defined only indirectly, access to the full data set can be restricted to only very small sub-matrices of the original matrix. In these cases, the matrix of nominal interest is not even available as an implicit operator, meaning that even matrix-vector products may not be available. In such settings, the matrix is "impalpable", in the sense that we have access to only masked snapshots of it. We draw on principles from free probability theory to introduce a novel method of "free decompression" to estimate the spectrum of such matrices. Our method can be used to extrapolate from the empirical spectral densities of small submatrices to infer the eigenspectrum of extremely large (impalpable) matrices (that we cannot form or even evaluate with full matrix-vector products). We demonstrate the effectiveness of this approach through a series of examples, comparing its performance against known limiting distributions from random matrix theory in synthetic settings, as well as applying it to submatrices of real-world datasets, matching them with their full empirical eigenspectra.
Sparse Gaussian Processes: Structured Approximations and Power-EP Revisited
Thang Bui · Michalis Titsias
Inducing-point-based sparse variational Gaussian processes have become the standard workhorse for scaling up GP models. Recent advances show that these methods can be improved by introducing a diagonal scaling matrix to the conditional posterior density given the inducing points. This paper first considers an extension that employs a block-diagonal structure for the scaling matrix, provably tightening the variational lower bound. We then revisit the unifying framework of sparse GPs based on Power Expectation Propagation (PEP) and show that it can leverage and benefit from the new structured approximate posteriors. Through extensive regression experiments, we show that the proposed block-diagonal approximation consistently performs similarly to or better than existing diagonal approximations while maintaining comparable computational costs. Furthermore, the new PEP framework with structured posteriors provides competitive performance across various power hyperparameter settings, offering practitioners flexible alternatives to standard variational approaches.
FEAT: Free energy Estimators with Adaptive Transport
Yuanqi Du · Jiajun He · Francisco Vargas · Yuanqing Wang · Carla Gomes · José Miguel Hernández-Lobato · Eric Vanden-Eijnden
We present Free energy Estimators with Adaptive Transport (FEAT), a novel framework for free energy estimation---a critical challenge across scientific domains. FEAT leverages learned transports implemented via stochastic interpolants and provides consistent, minimum-variance estimators based on escorted Jarzynski equality and controlled Crooks theorem, alongside variational upper and lower bounds on free energy differences. Unifying equilibrium and non-equilibrium methods under a single theoretical framework, FEAT establishes a principled foundation for neural free energy calculations. Experimental validation on toy examples, molecular simulations, and quantum field theory demonstrates promising improvements over existing learning-based methods. Our PyTorch implementation is available at https://github.com/jiajunhe98/FEAT.
Non-equilibrium Annealed Adjoint Sampler
Jaemoo Choi · Yongxin Chen · Molei Tao · Guan-Horng Liu
Recently, there has been significant progress in learning-based diffusion samplers, which aim to sample from a given unnormalized density. Many of these approaches formulate the sampling task as a stochastic optimal control (SOC) problem using a canonical uninformative reference process, which limits their ability to efficiently guide trajectories toward the target distribution. In this work, we propose the Non-Equilibrium Annealed Adjoint Sampler (NAAS), a novel SOC-based diffusion framework that employs annealed reference dynamics as a non-stationary base SDE. This annealing structure provides a natural progression toward the target distribution and generates informative reference trajectories, thereby enhancing the stability and efficiency of learning the control. Owing to our SOC formulation, our framework can incorporate a variety of SOC solvers, thereby offering high flexibility in algorithmic design. As one instantiation, we employ a lean adjoint system inspired by adjoint matching, enabling efficient and scalable training. We demonstrate the effectiveness of NAAS across a range of tasks, including sampling from classical energy landscapes and molecular Boltzmann distributions.
Fast Non-Log-Concave Sampling under Nonconvex Equality and Inequality Constraints with Landing
Kijung Jeon · Michael Muehlebach · Molei Tao
Sampling from constrained statistical distributions is a fundamental task in various fields including Bayesian statistics, computational chemistry, and statistical physics. This article considers the cases where the constrained distribution is described by an unconstrained density, as well as additional equality and/or inequality constraints, which often make the constraint set nonconvex. Existing methods for nonconvex constraint set $\Sigma \subset \mathbb{R}^d$ defined by equality or inequality constraints commonly rely on costly projection steps. Moreover, they cannot handle equality and inequality constraints simultaneously as each method only specialized in one case. In addition, rigorous and quantitative convergence guarantee is often lacking. In this paper, we introduce Overdamped Langevin with LAnding (OLLA), a new framework that can design overdamped Langevin dynamics accommodating both equality and inequality constraints. The proposed dynamics also deterministically corrects trajectories along the normal direction of the constraint surface, thus obviating the need for explicit projections. We show that, under suitable regularity conditions on the target density and $\Sigma$, OLLA converges exponentially fast in $W_2$ distance to the constrained target density $\rho_\Sigma(x) \propto \exp(-f(x))d\sigma_\Sigma$. Lastly, through experiments, we demonstrate the efficiency of OLLA compared to projection-based constrained Langevin algorithms and their slack variable variants, highlighting its favorable computational cost and reasonable empirical mixing.
Variance-Reduced Long-Term Rehearsal Learning with Quadratic Programming Reformulation
Wen-Bo Du · Tian Qin · Tian-Zuo Wang · Zhi-Hua Zhou
In machine learning, a critical class of decision-making problems involves *Avoiding Undesired Future* (AUF): given a predicted undesired outcome, how can one make decision about actions to prevent it? Recently, the *rehearsal learning* framework has been proposed to address AUF problem. While existing methods offer reliable decisions for single-round success, this paper considers long-term settings that involve coordinating multiple future outcomes, which is often required in real-world tasks. Specifically, we generalize the AUF objective to characterize a long-term decision target that incorporates cross-temporal relations among variables. As directly optimizing the *AUF probability* $\mathbb{P}_{\operatorname{AUF}}$ over this objective remains challenging, we derive an explicit expression for the objective and further propose a quadratic programming (QP) reformulation that transforms the intractable probabilistic AUF optimization into a tractable one. Under mild assumptions, we show that solutions to the QP reformulation are equivalent to those of the original AUF optimization, based on which we develop two novel rehearsal learning methods for long-term decision-making: (i) a *greedy* method that maximizes the single-round $\mathbb{P}_{\operatorname{AUF}}$ at each step, and (ii) a *far-sighted* method that accounts for future consequences in each decision, yielding a higher overall $\mathbb{P}_{\operatorname{AUF}}$ through an $L/(L+1)$ variance reduction in the AUF objective. We further establish an $\mathcal{O}(1/\sqrt{N})$ excess risk bound for decisions based on estimated parameters, ensuring reliable practical applicability with finite data. Experiments validate the effectiveness of our approach.
Neurosymbolic Diffusion Models
Emile van Krieken · Pasquale Minervini · Edoardo Maria Ponti · Antonio Vergari
Neurosymbolic (NeSy) predictors combine neural perception with symbolic reasoning to solve tasks like visual reasoning. However, standard NeSy predictors assume conditional independence between the symbols they extract, thus limiting their ability to model interactions and uncertainty --- often leading to overconfident predictions and poor out-of-distribution generalisation. To overcome the limitations of the independence assumption, we introduce neurosymbolic diffusion models (NeSyDMs), a new class of NeSy predictors that use discrete diffusion to model dependencies between symbols. Our approach reuses the independence assumption from NeSy predictors at each step of the diffusion process, enabling scalable learning while capturing symbol dependencies and uncertainty quantification. Across both synthetic and real-world benchmarks — including high-dimensional visual path planning and rule-based autonomous driving — NeSyDMs achieve state-of-the-art accuracy among NeSy predictors and demonstrate strong calibration.
Anytime-valid, Bayes-assisted, Prediction-Powered Inference
Valentin Kilian · Stefano Cortinovis · Francois Caron
Given a large pool of unlabelled data and a smaller amount of labels, prediction-powered inference (PPI) leverages machine learning predictions to increase the statistical efficiency of confidence interval procedures based solely on labelled data, while preserving fixed-time validity. In this paper, we extend the PPI framework to the sequential setting, where labelled and unlabelled datasets grow over time. Exploiting Ville's inequality and the method of mixtures, we propose prediction-powered confidence sequence procedures that are asymptotically valid uniformly over time and naturally accommodate prior knowledge on the quality of the predictions to further boost efficiency. We carefully illustrate the design choices behind our method and demonstrate its effectiveness in real and synthetic examples.
Rao-Blackwellised Reparameterisation Gradients
Kevin H. Lam · Thang Bui · George Deligiannidis · Yee Whye Teh
Latent Gaussian variables have been popularised in probabilistic machine learning. In turn, gradient estimators are the machinery that facilitates gradient-based optimisation for models with latent Gaussian variables. The reparameterisation trick is often used as the default estimator as it is simple to implement and yields low-variance gradients for variational inference. In this work, we propose the R2-G2 estimator as the Rao-Blackwellisation of the reparameterisation gradient estimator. Interestingly, we show that the local reparameterisation gradient estimator for Bayesian MLPs is an instance of the R2-G2 estimator and Rao-Blackwellisation. This lets us extend benefits of Rao-Blackwellised gradients to a suite of probabilistic models. We show that initial training with R2-G2 consistently yields better performance in models with multiple applications of the reparameterisation trick.
Uncertainty Estimation by Flexible Evidential Deep Learning
Taeseong Yoon · Heeyoung Kim
Uncertainty quantification (UQ) is crucial for deploying machine learning models in high-stakes applications, where overconfident predictions can lead to serious consequences. An effective UQ method must balance computational efficiency with the ability to generalize across diverse scenarios. Evidential deep learning (EDL) achieves efficiency by modeling uncertainty through the prediction of a Dirichlet distribution over class probabilities. However, the restrictive assumption of Dirichlet-distributed class probabilities limits EDL's robustness, particularly in complex or unforeseen situations. To address this, we propose *flexible evidential deep learning* ($\mathcal{F}$-EDL), which extends EDL by predicting a flexible Dirichlet distribution—a generalization of the Dirichlet distribution—over class probabilities. This approach provides a more expressive and adaptive representation of uncertainty, significantly enhancing UQ generalization and reliability under challenging scenarios. We theoretically establish several advantages of $\mathcal{F}$-EDL and empirically demonstrate its state-of-the-art UQ performance across diverse evaluation settings, including classical, long-tailed, and noisy in-distribution scenarios.
Data-Driven Performance Guarantees for Classical and Learned Optimizers
Rajiv Sambharya · Bartolomeo Stellato
We introduce a data-driven approach to analyze the performance of continuous optimization algorithms using generalization guarantees from statistical learning theory. We study classical and learned optimizers to solve families of parametric optimization problems. We build generalization guarantees for classical optimizers, using a sample convergence bound, and for learned optimizers, using the Probably Approximately Correct (PAC)-Bayes framework. To train learned optimizers, we use a gradient-based algorithm to directly minimize the PAC-Bayes upper bound. Numerical experiments in signal processing, control, and meta-learning showcase the ability of our framework to provide strong generalization guarantees for both classical and learned optimizers given a fixed budget of iterations. For classical optimizers, our bounds which hold with high probability are much tighter than those that worst-case guarantees provide. For learned optimizers, our bounds outperform the empirical outcomes observed in their non-learned counterparts.
Learning from A Single Markovian Trajectory: Optimality and Variance Reduction
Zhenyu Sun · Ermin Wei
In this paper, we consider the general stochastic non-convex optimization problem when the sampling process follows a Markov chain. This problem exhibits its significance in capturing many real-world applications, ranging from asynchronous distributed learning to reinforcement learning. In particular, we consider the worst case where one has no prior knowledge and control of the Markov chain, meaning multiple trajectories cannot be simulated but only a single trajectory is available for algorithm design. We first provide algorithm-independent lower bounds with $\Omega(\epsilon^{-3})$ (and $\Omega(\epsilon^{-4})$) samples, when objectives are (mean-squared) smooth, for any first-order methods accessing bounded variance gradient oracles to achieve $\epsilon$-approximate critical solutions of original problems. Then, we propose \textbf{Ma}rkov-\textbf{C}hain \textbf{SPIDER} (MaC-SPIDER), which leverages variance-reduced techniques, to achieve a $\mathcal{O}(\epsilon^{-3})$ upper bound for mean-squared smooth objective functions. To the best of our knowledge, MaC-SPIDER is the first to achieve $\mathcal{O}(\epsilon^{-3})$ complexity when sampling from a single Markovian trajectory. And our proposed lower bound concludes its (near) optimality.
Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise
Haocheng Luo · Mehrtash Harandi · Dinh Phung · Trung Le
Sharpness‐aware minimization (SAM) has emerged as a highly effective technique for improving model generalization, but its underlying principles are not fully understood. We investigated the phenomenon known as m-sharpness, where the performance of SAM improves monotonically as the micro-batch size for computing perturbations decreases. In practice, the empirical m-sharpness effect underpins the deployment of SAM in distributed training, yet a rigorous theoretical account has remained lacking. To provide a theoretical explanation for m-sharpness, we leverage an extended Stochastic Differential Equation (SDE) framework and analyze the structure of stochastic gradient noise (SGN) to characterize the dynamics of various SAM variants, including n-SAM and m-SAM. Our findings reveal that the stochastic noise introduced during SAM perturbations inherently induces a variance-based sharpness regularization effect. Motivated by our theoretical insights, we introduce Reweighted SAM (RW-SAM), which employs sharpness-weighted sampling to mimic the generalization benefits of m-SAM while remaining parallelizable. Comprehensive experiments validate the effectiveness of our theoretical analysis and proposed method.
Streaming Stochastic Submodular Maximization with On-Demand User Requests
Honglian Wang · Sijing Tu · Lutz Oettershagen · Aristides Gionis
We explore a novel problem in streaming submodular maximization, inspired by the dynamics of news-recommendation platforms. We consider a setting where users can visit a news web\-site at any time, and upon each visit, the web\-site must display up to $k$ news items. User interactions are inherently stochastic: each news item presented to the user is consumed with a certain acceptance probability by the user, and each news item covers certain topics. Our goal is to design a streaming algorithm that maximizes the expected total topic coverage. To address this problem, we establish a connection to submodular maximization subject to a matroid constraint. We show that we can effectively adapt previous methods to address our problem when the number of user visits is known in advance or linear-size memory in the stream length is available. However, in more realistic scenarios where only an upper bound on the visits and sublinear memory is available, the algorithms fail to guarantee any bounded performance. To overcome these limitations, we introduce a new online streaming algorithm that achieves a competitive ratio of $1/8\delta$, where $\delta$ controls the approximation quality. Moreover, it requires only a single pass over the stream, and uses memory independent of the stream length. Empirically, our algorithms consistently outperform the baselines.
Stochastic Momentum Methods for Non-smooth Non-Convex Finite-Sum Coupled Compositional Optimization
Xingyu Chen · Bokun Wang · Ming Yang · Qihang Lin · Tianbao Yang
Finite-sum Coupled Compositional Optimization (FCCO), characterized by its coupled compositional objective structure, emerges as an important optimization paradigm for addressing a wide range of machine learning problems. In this paper, we focus on a challenging class of non-convex non-smooth FCCO, where the outer functions are non-smooth weakly convex or convex and the inner functions are smooth or weakly convex. Existing state-of-the-art result face two key limitations: (1) a high iteration complexity of $O(1/\epsilon^6)$ under the assumption that the stochastic inner functions are Lipschitz continuous in expectation; (2) reliance on vanilla SGD-type updates, which are not suitable for deep learning applications. Our main contributions are two fold: (i) We propose stochastic momentum methods tailored for non-smooth FCCO that come with provable convergence guarantees; (ii) We establish a **new state-of-the-art** iteration complexity of $O(1/\epsilon^5)$. Moreover, we apply our algorithms to multiple inequality constrained non-convex optimization problems involving smooth or weakly convex functional inequality constraints. By optimizing a smoothed hinge penalty based formulation, we achieve a **new state-of-the-art** complexity of $O(1/\epsilon^5)$ for finding an (nearly) $\epsilon$-level KKT solution. Experiments on three tasks demonstrate the effectiveness of the proposed algorithms.
Zeroth-Order Optimization Finds Flat Minima
Liang Zhang · Bingcong Li · Kiran Thekumparampil · Sewoong Oh · Michael Muehlebach · Niao He
Zeroth-order methods are extensively used in machine learning applications where gradients are infeasible or expensive to compute, such as black-box attacks, reinforcement learning, and language model fine-tuning. Existing optimization theory focuses on convergence to an arbitrary stationary point, but less is known on the implicit regularization that provides a fine-grained characterization on which particular solutions are finally reached. We show that zeroth-order optimization with the standard two-point estimator favors solutions with small trace of Hessian, which is widely used in previous work to distinguish between sharp and flat minima. We further provide convergence rates of zeroth-order optimization to approximate flat minima for convex and sufficiently smooth functions, where flat minima are defined as the minimizers that achieve the smallest trace of Hessian among all optimal solutions. Experiments on binary classification tasks with convex losses and language model fine-tuning support our theoretical findings.
Optimistic Online-to-Batch Conversions for Accelerated Convergence and Universality
Yu-Hu Yan · Peng Zhao · Zhi-Hua Zhou
In this work, we study offline convex optimization with smooth objectives, where the classical Nesterov's Accelerated Gradient (NAG) method achieves the optimal accelerated convergence. Extensive research has aimed to understand NAG from various perspectives, and a recent line of work approaches this from the viewpoint of online learning and online-to-batch conversion, emphasizing the role of optimistic online algorithms for acceleration. In this work, we contribute to this perspective by proposing novel optimistic online-to-batch conversions that incorporate optimism theoretically into the analysis, thereby significantly simplifying the online algorithm design while preserving the optimal convergence rates. Specifically, we demonstrate the effectiveness of our conversions through the following results: (i) when combined with simple online gradient descent, our optimistic conversion achieves the optimal accelerated convergence; (ii) our conversion also applies to strongly convex objectives, and by leveraging both optimistic online-to-batch conversion and optimistic online algorithms, we achieve the optimal accelerated convergence rate for strongly convex and smooth objectives, for the first time through the lens of online-to-batch conversion; (iii) our optimistic conversion can achieve universality to smoothness~---~applicable to both smooth and non-smooth objectives without requiring knowledge of the smoothness coefficient~---~and remains efficient as non-universal methods by using only one gradient query in each iteration. Finally, we highlight the effectiveness of our optimistic online-to-batch conversions by a precise correspondence with NAG.
Efficient Quadratic Corrections for Frank-Wolfe Algorithms
Jannis Halbey · Seta Rakotomandimby · Mathieu Besançon · Sébastien Designolle · Sebastian Pokutta
We develop a Frank-Wolfe algorithm with corrective steps, generalizing previous algorithms including Blended Conditional Gradients, Blended Pairwise Conditional Gradients, and Fully-Corrective Frank-Wolfe. For this, we prove tight convergence guarantees together with an optimal face identification property. Furthermore, we propose two highly efficient corrective steps for convex quadratic objectives based on linear optimization or linear system solving, akin to Wolfe's Minimum-Norm Point algorithm, and prove finite-time convergence under suitable conditions. Beyond optimization problems that are directly quadratic, we revisit two algorithms, Split Conditional Gradient and Second-Order Conditional Gradient Sliding, which can leverage quadratic corrections to accelerate the solution of their quadratic subproblems. We show improved convergence rates for the first and prove broader applicability for the second. Finally, we demonstrate substantial computational speedups for Frank-Wolfe-based algorithms with quadratic corrections across the considered problem classes.
Multiclass Loss Geometry Matters for Generalization of Gradient Descent in Separable Classification
Matan Schliserman · Tomer Koren
We study the generalization performance of unregularized gradient methods for separable linear classification. While previous work mostly deal with the binary case, we focus on the multiclass setting with $k$ classes and establish novel population risk bounds for Gradient Descent for loss functions that decay to zero. In this setting, we show risk bounds that reveal that convergence rates are crucially influenced by the geometry of the loss template, as formalized by Wang and Scott (2024), rather than of the loss function itself. Particularly, we establish risk upper bounds that holds for any decay rate of the loss whose template is smooth with respect to the $p$-norm. In the case of exponentially decaying losses, our results indicates a contrast between the $p=\infty$ case, where the risk exhibits a logarithmic dependence on $k$, and $p=2$ where the risk scales linearly with $k$. To establish this separation formally, we also prove a lower bound in the latter scenario, demonstrating that the polynomial dependence on $k$ is unavoidable. Central to our analysis is a novel bound on the Rademacher complexity of low-noise vector-valued linear predictors with a loss template smooth w.r.t. general $p$-norms.
One of the major bottlenecks for deploying popular first-order differentially private (DP) machine learning algorithms (e.g., DP-SGD) lies in their high computation and memory cost, despite the existence of optimized implementations. Zeroth-order methods have promise in mitigating the overhead, as they leverage function evaluations to approximate the gradients, hence significantly easier to privatize. While recent works have explored zeroth-order approaches in both private and non-private settings, they still suffer from relatively low utilities compared with DP-SGD, and have only been evaluated in limited application domains. In this work, we propose to leverage public information to guide and improve gradient approximation of private zeroth-order algorithms. We explore a suite of \underline{p}ublic-data-\underline{a}ssisted \underline{z}eroth-\underline{o}rder optimizers (PAZO) with minimal overhead. We provide theoretical analyses of the PAZO framework under an assumption of the similarity between public and private data. Empirically, we demonstrate that PAZO achieves superior privacy/utility tradeoffs across vision and text tasks in both pre-training and fine-tuning settings, outperforming the best first-order baselines (with public data) especially in highly private regimes, while offering up to $16\times$ runtime speedup.
Solving Discrete (Semi) Unbalanced Optimal Transport with Equivalent Transformation Mechanism and KKT-Multiplier Regularization
Weiming Liu · Xinting Liao · Jun Dan · Fan Wang · Hua Yu · Junhao Dong · Shunjie Dong · Lianyong Qi · Yew Soon Ong
Semi-Unbalanced Optimal Transport (SemiUOT) shows great promise in matching two probability measures by relaxing one of the marginal constraints. Previous solvers often incorporate an entropy regularization term, which can result in inaccurate matching solutions. To address this issue, we focus on determining the marginal probability distribution of SemiUOT with KL divergence using the proposed Equivalent Transformation Mechanism (ETM) approach. Furthermore, we extend the ETM-based method into exploiting the marginal probability distribution of Unbalanced Optimal Transport (UOT) with KL divergence for validating its generalization. Once the marginal probabilities of UOT/SemiUOT are determined, they can be transformed into a classical Optimal Transport (OT) problem. Moreover, we propose a KKT-Multiplier regularization term combined with Multiplier Regularized Optimal Transport (MROT) to achieve more accurate matching results. We conduct several numerical experiments to demonstrate the effectiveness of our proposed methods in addressing UOT/SemiUOT problems.
MTL-KD: Multi-Task Learning Via Knowledge Distillation for Generalizable Neural Vehicle Routing Solver
yuepeng zheng · Fu Luo · Zhenkun Wang · Yaoxin Wu · Yu Zhou
Multi-Task Learning (MTL) in Neural Combinatorial Optimization (NCO) is a promising approach for training a unified model capable of solving multiple Vehicle Routing Problem (VRP) variants. However, existing Reinforcement Learning (RL)-based multi-task methods can only train light decoder models on small-scale problems, exhibiting limited generalization ability when solving large-scale problems. To overcome this limitation, this work introduces a novel multi-task learning method driven by knowledge distillation (MTL-KD), which enables efficient training of heavy decoder models with strong generalization ability. The proposed MTL-KD method transfers policy knowledge from multiple distinct RL-based single-task models to a single heavy decoder model, facilitating label-free training and effectively improving the model's generalization ability across diverse tasks. In addition, we introduce a flexible inference strategy termed Random Reordering Re-Construction (R3C), which is specifically adapted for diverse VRP tasks and further boosts the performance of the multi-task model. Experimental results on 6 seen and 10 unseen VRP variants with up to 1,000 nodes indicate that our proposed method consistently achieves superior performance on both uniform and real-world benchmarks, demonstrating robust generalization abilities. The code is available at https://github.com/CIAM-Group/MTLKD.
Large Language Models as End-to-end Combinatorial Optimization Solvers
Xia Jiang · Yaoxin Wu · Minshuo Li · Zhiguang Cao · Yingqian Zhang
Combinatorial optimization (CO) problems, central to decision-making scenarios like logistics and manufacturing, are traditionally solved using problem-specific algorithms requiring significant domain expertise. While large language models (LLMs) have shown promise in automating CO problem solving, existing approaches rely on intermediate steps such as code generation or solver invocation, limiting their generality and accessibility. This paper introduces a novel framework that empowers LLMs to serve as end-to-end CO solvers by directly mapping natural language problem descriptions to solutions. We propose a two-stage training strategy: supervised fine-tuning (SFT) imparts LLMs with solution construction patterns from domain-specific solvers, while a feasibility-and-optimality-aware reinforcement learning (FOARL) process explicitly mitigates constraint violations and refines solution quality. Evaluation across seven NP-hard CO problems shows that our method achieves a high feasibility rate and reduces the average optimality gap to 1.03–8.20% by tuning a 7B-parameter LLM, surpassing both general-purpose LLMs (e.g., GPT-4o), reasoning models (e.g., DeepSeek-R1), and domain-specific heuristics. Our method establishes a unified language-based pipeline for CO without extensive code execution or manual architectural adjustments for different problems, offering a general and language-driven alternative to traditional solver design while maintaining relative feasibility guarantees.
Bilevel Optimization for Adversarial Learning Problems: Sharpness, Generation, and Beyond
Risheng Liu · Zhu Liu · Weihao Mao · Wei Yao · Jin Zhang
Adversarial learning is a widely used paradigm in machine learning, often formulated as a min-max optimization problem where the inner maximization imposes adversarial constraints to guide the outer learner toward more robust solutions. This framework underlies methods such as Sharpness-Aware Minimization (SAM) and Generative Adversarial Networks (GANs). However, traditional gradient-based approaches to such problems often face challenges in balancing accuracy and efficiency due to second-order complexities. In this paper, we propose a bilevel optimization framework that reformulates these adversarial learning problems by leveraging the tractability of the lower-level problem. The bilevel framework introduces no additional complexity and enables the use of advanced bilevel tools. We further develop a provably convergent single-loop stochastic algorithm that effectively balances learning accuracy and computational cost. Extensive experiments show that our method improves generation quality in terms of FID and JS scores for GANs, and consistently achieves higher accuracy for SAM under label noise and across various backbones, while promoting flatter loss landscapes. Overall, this work provides a practical and theoretically grounded framework for solving adversarial learning tasks through bilevel optimization.
3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs
Mehdi Makni · Xiang Meng · Rahul Mazumder
Sparse plus Low-Rank $(\mathbf{S} + \mathbf{L}\mathbf{R})$ decomposition of Large Language Models (LLMs) has emerged as a promising direction in $\textit{model compression}$, aiming to decompose pre-trained model weights into a sum of sparse and low-rank matrices $\mathbf{W} \approx \mathbf{S} + \mathbf{LR}$. Despite recent progress, existing methods often suffer from substantial performance degradation compared to dense models. In this work, we introduce $\texttt{3BASiL-TM}$, an efficient one-shot post-training method for $(\mathbf{S} + \mathbf{L}\mathbf{R})$ decomposition of LLMs that addresses this gap. Our approach first introduces a novel 3-Block Alternating Direction Method of Multipliers (ADMM) method, termed $\texttt{3BASiL}$, to minimize the layer-wise reconstruction error with convergence guarantees. We then design a transformer-matching ($\texttt{TM}$) refinement step that jointly optimizes the sparse and low-rank components across transformer layers. This step minimizes a novel memory-efficient loss that aligns outputs at the transformer level. Notably, the $\texttt{TM}$ procedure is universal as it can enhance any $(\mathbf{S} + \mathbf{L}\mathbf{R})$ decomposition, including pure sparsity. Our numerical experiments show that $\texttt{3BASiL-TM}$ reduces the WikiText2 perplexity gap to dense LLaMA-8B model by over 30% under a (2:4 Sparse + 64 LR) configuration, compared to prior methods. Moreover, our method achieves over 2.5x faster compression runtime on an A100 GPU compared to SOTA $(\mathbf{S} + \mathbf{L}\mathbf{R})$ method. Our code is available at https://github.com/mazumder-lab/3BASiL.
Optimizing the Unknown: Black Box Bayesian Optimization with Energy-Based Model and Reinforcement Learning
RUIYAO MIAO · Junren Xiao · Shiya Tsang · Hui Xiong · Ying Nian Wu
Existing Bayesian Optimization (BO) methods typically balance exploration and exploitation to optimize costly objective functions. However, these methods often suffer from a significant one-step bias, which may lead to convergence toward local optima and poor performance in complex or high-dimensional tasks. Recently, Black-Box Optimization (BBO) has achieved success across various scientific and engineering domains, particularly when function evaluations are costly and gradients are unavailable. Motivated by this, we propose the Reinforced Energy-Based Model for Bayesian Optimization (REBMBO), which integrates Gaussian Processes (GP) for local guidance with an Energy-Based Model (EBM) to capture global structural information. Notably, we define each Bayesian Optimization iteration as a Markov Decision Process (MDP) and use Proximal Policy Optimization (PPO) for adaptive multi-step lookahead, dynamically adjusting the depth and direction of exploration to effectively overcome the limitations of traditional BO methods. We conduct extensive experiments on synthetic and real-world benchmarks, confirming the superior performance of REBMBO. Additional analyses across various GP configurations further highlight its adaptability and robustness.
Acceleration via silver step-size on Riemannian manifolds with applications to Wasserstein space
Jiyoung Park · Abhishek Roy · Jonathan W. Siegel · Anirban Bhattacharya
There is extensive literature on accelerating first-order optimization methods in an Euclidean setting. Under which conditions such acceleration is feasible in Riemannian optimization problems is an active area of research. Motivated by the recent success of silver stepsize methods in the Euclidean setting, we undertake a study of such algorithms in the Riemannian setting. We provide the new class of algorithms determined by the choice of vector transport that allows the silver stepsize acceleration on Riemannian manifolds for the function classes associated with the corresponding vector transport. As a core application, we show that our algorithm recovers the standard Wasserstein gradient descent on the 2-Wasserstein space and, as a result, provides the first provable accelerated gradient method for potential functional optimization problems in the Wasserstein space. In addition, we validate the numerical strength of the algorithm for standard benchmark tasks on the space of symmetric positive definite matrices.
NPN: Non-Linear Projections of the Null-Space for Imaging Inverse Problems
Roman Jacome · Romario Gualdrón-Hurtado · León Suárez-Rodríguez · Henry Arguello
Imaging inverse problems aim to recover high-dimensional signals from undersampled, noisy measurements, a fundamentally ill-posed task with infinite solutions in the null-space of the sensing operator. To resolve this ambiguity, prior information is typically incorporated through handcrafted regularizers or learned models that constrain the solution space. However, these priors typically ignore the task-specific structure of that null-space. In this work, we propose Non-Linear Projections of the Null-Space (NPN), a novel class of regularization that, instead of enforcing structural constraints in the image domain, promotes solutions that lie in a low-dimensional projection of the sensing matrix's null-space with a neural network. Our approach has two key advantages: (1) Interpretability: by focusing on the structure of the null-space, we design sensing-matrix-specific priors that capture information orthogonal to the signal components that are fundamentally blind to the sensing process. (2) Flexibility: NPN is adaptable to various inverse problems, compatible with existing reconstruction frameworks, and complementary to conventional image-domain priors. We provide theoretical guarantees on convergence and reconstruction accuracy when used within plug-and-play methods. Empirical results across diverse sensing matrices demonstrate that NPN priors consistently enhance reconstruction fidelity in various imaging inverse problems, such as compressive sensing, deblurring, super-resolution, computed tomography, and magnetic resonance imaging, with plug-and-play methods, unrolling networks, deep image prior, and diffusion models.
Stochastic Regret Guarantees for Online Zeroth- and First-Order Bilevel Optimization
Parvin Nazari · Bojian Hou · Davoud Ataee Tarzanagh · Li Shen · George Michailidis
Online bilevel optimization (OBO) is a powerful framework for machine learning problems where both outer and inner objectives evolve over time, requiring dynamic updates. Current OBO approaches rely on deterministic \textit{window-smoothed} regret minimization, which may not accurately reflect system performance when functions change rapidly. In this work, we introduce a novel search direction and show that both first- and zeroth-order (ZO) stochastic OBO algorithms leveraging this direction achieve sublinear {stochastic bilevel regret without window smoothing}. Beyond these guarantees, our framework enhances efficiency by: (i) reducing oracle dependence in hypergradient estimation, (ii) updating inner and outer variables alongside the linear system solution, and (iii) employing ZO-based estimation of Hessians, Jacobians, and gradients. Experiments on online parametric loss tuning and black-box adversarial attacks validate our approach.
Follow-the-Perturbed-Leader Nearly Achieves Best-of-Both-Worlds for the m-Set Semi-Bandit Problems
Jingxin Zhan · Yuchen Xin · Chenjie Sun · Zhihua Zhang
We consider a common case of the combinatorial semi-bandit problem, the $m$-set semi-bandit, where the learner exactly selects $m$ arms from the total $d$ arms. In the adversarial setting, the best regret bound, known to be $\mathcal{O}(\sqrt{nmd})$ for time horizon $n$, is achieved by the well-known Follow-the-Regularized-Leader (FTRL) policy. However, this requires to explicitly compute the arm-selection probabilities via optimizing problems at each time step and sample according to them. This problem can be avoided by the Follow-the-Perturbed-Leader (FTPL) policy, which simply pulls the $m$ arms that rank among the $m$ smallest (estimated) loss with random perturbation. In this paper, we show that FTPL with a Fr\'echet perturbation also enjoys the near optimal regret bound $\mathcal{O}(\sqrt{nm}(\sqrt{d\log(d)}+m^{5/6}))$ in the adversarial setting and approaches best-of-both-world regret bounds, i.e., achieves a logarithmic regret for the stochastic setting. Moreover, our lower bounds show that the extra factors are unavoidable with our approach; any improvement would require a fundamentally different and more challenging method.
Fairness-Regularized Online Optimization with Switching Costs
Pengfei Li · Yuelin Han · Adam Wierman · Shaolei Ren
Fairness and action smoothness are two crucial considerations in many online optimization problems, but they have yet to be addressed simultaneously. In this paper, we study a new and challenging setting of fairness-regularized smoothed online convex optimization with switching costs. First, to highlight the fundamental challenges introduced by the long-term fairness regularizer evaluated based on the entire sequence of actions, we prove that even without switching costs, no online algorithms can possibly achieve a sublinear regret or finite competitive ratio compared to the offline optimal algorithm as the problem episode length $T$ increases. Then, we propose **FairOBD** (Fairness-regularized Online Balanced Descent), which reconciles the tension between minimizing the hitting cost, switching cost, and fairness cost. Concretely, **FairOBD** decomposes the long-term fairness cost into a sequence of online costs by introducing an auxiliary variable and then leverages the auxiliary variable to regularize the online actions for fair outcomes. Based on a new approach to account for switching costs, we prove that **FairOBD** offers a worst-case asymptotic competitive ratio against a novel benchmark---the optimal offline algorithm with parameterized constraints---by considering $T\to\infty$. Finally, we run trace-driven experiments of dynamic computing resource provisioning for socially responsible AI inference to empirically evaluate **FairOBD**, showing that **FairOBD** can effectively reduce the total fairness-regularized cost and better promote fair outcomes compared to existing baseline solutions.
Purity Law for Neural Routing Problem Solvers with Enhanced Generalizability
Wenzhao Liu · Haoran Li · Congying Han · Zicheng Zhang · Anqi Li · Tiande Guo
Achieving generalization in neural approaches across different scales and distributions remains a significant challenge for routing problems. A key obstacle is that neural networks often fail to learn robust principles for identifying universal patterns and deriving optimal solutions from diverse instances. In this paper, we first uncover Purity Law, a fundamental structural principle for optimal solutions of routing problems, defining that edge prevalence grows exponentially with the sparsity of surrounding vertices. Statistically and theoretically validated across diverse instances, Purity Law reveals a consistent bias toward local sparsity in global optima. Building on this insight, we propose Purity Policy Optimization (PUPO), a novel training paradigm that explicitly aligns characteristics of neural solutions with Purity Law during the solution construction process to enhance generalization. Extensive experiments demonstrate that PUPO can be seamlessly integrated with popular neural solvers, significantly enhancing their generalization performance without incurring additional computational overhead during inference.
OptiTree: Hierarchical Thoughts Generation with Tree Search for LLM Optimization Modeling
Haoyang Liu · Jie Wang · Yuyang Cai · Xiongwei Han · Yufei Kuang · Jianye Hao
Optimization modeling is one of the most crucial but technical parts of operations research (OR). To automate the modeling process, existing works have leveraged large language models (LLMs), prompting them to break down tasks into steps for generating variables, constraints, and objectives. However, due to the highly complex mathematical structures inherent in OR problems, standard fixed-step decomposition often fails to achieve high performance. To address this challenge, we introduce OptiTree, a novel tree search approach designed to enhance modeling capabilities for complex problems through adaptive problem decomposition into simpler subproblems. Specifically, we develop a modeling tree that organizes a wide range of OR problems based on their hierarchical problem taxonomy and complexity, with each node representing a problem category and containing relevant high-level modeling thoughts. Given a problem to model, we recurrently search the tree to identify a series of simpler subproblems and synthesize the global modeling thoughts by adaptively integrating the hierarchical thoughts. Experiments show that OptiTree significantly improves the modeling accuracy compared to the state-of-the-art, achieving over 10% improvements on the challenging benchmarks.
Improved Algorithms for Overlapping and Robust Clustering of Edge-Colored Hypergraphs: An LP-Based Combinatorial Approach
Changyeol Lee · Yongho Shin · Hyung-Chan An
Clustering is a fundamental task in both machine learning and data mining. Among various methods, edge-colored clustering (ECC) has emerged as a useful approach for handling categorical data. Given a hypergraph with (hyper)edges labeled by colors, ECC aims to assign vertex colors to minimize the number of edges where the vertex color differs from the edge's color. However, traditional ECC has inherent limitations, as it enforces a nonoverlapping and exhaustive clustering. To tackle these limitations, three versions of ECC have been studied: Local ECC and Global ECC, which allow overlapping clusters, and Robust ECC, which accounts for vertex outliers. For these problems, both linear programming (LP) rounding algorithms and greedy combinatorial algorithms have been proposed. While these LP-rounding algorithms provide high-quality solutions, they demand substantial computation time; the greedy algorithms, on the other hand, run very fast but often compromise solution quality. In this paper, we present a family of algorithms that combines the strengths of LP with the computational efficiency of combinatorial algorithms. Both experimental and theoretical analyses show that our algorithms efficiently produce high-quality solutions for all three problems: Local, Global, and Robust ECC. We complement our algorithmic contributions with complexity-theoretic inapproximability results and integrality gap bounds, which suggest that significant theoretical improvements are unlikely. Our results also answer two open questions previously raised in the literature.
Hybrid-Balance GFlowNet for Solving Vehicle Routing Problems
Ni Zhang · Zhiguang Cao
Existing GFlowNet-based methods for vehicle routing problems (VRPs) typically employ Trajectory Balance (TB) to achieve global optimization but often neglect important aspects of local optimization. While Detailed Balance (DB) addresses local optimization more effectively, it alone falls short in solving VRPs, which inherently require holistic trajectory optimization. To address these limitations, we introduce the Hybrid-Balance GFlowNet (HBG) framework, which uniquely integrates TB and DB in a principled and adaptive manner by aligning their intrinsically complementary strengths. Additionally, we propose a specialized inference strategy for depot-centric scenarios like the Capacitated Vehicle Routing Problem (CVRP), leveraging the depot node's greater flexibility in selecting successors. Despite this specialization, HBG maintains broad applicability, extending effectively to problems without explicit depots, such as the Traveling Salesman Problem (TSP). We evaluate HBG by integrating it into two established GFlowNet-based solvers, i.e., AGFN and GFACS, and demonstrate consistent and significant improvements across both CVRP and TSP, underscoring the enhanced solution quality and generalization afforded by our approach.
Improving Generalization of Neural Combinatorial Optimization for Vehicle Routing Problems via Test-Time Projection Learning
Yuanyao Chen · Rongsheng Chen · Fu Luo · Zhenkun Wang
Neural Combinatorial Optimization (NCO) has emerged as a promising learning-based paradigm for addressing Vehicle Routing Problems (VRPs) by minimizing the need for extensive manual engineering. While existing NCO methods, trained on small-scale instances (e.g., 100 nodes), have demonstrated considerable success on problems of similar scale, their performance significantly degrades when applied to large-scale scenarios. This degradation arises from the distributional shift between training and testing data, rendering policies learned on small instances ineffective for larger problems. To overcome this limitation, we introduce a novel learning framework driven by Large Language Models (LLMs). This framework learns a projection between the training and testing distributions, which is then deployed to enhance the scalability of the NCO model. Notably, unlike prevailing techniques that necessitate joint training with the neural network, our approach operates exclusively during the inference phase, obviating the need for model retraining. Extensive experiments demonstrate that our method enables a backbone model (trained on 100-node instances) to achieve superior performance on large-scale Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) of up to 100K nodes from diverse distributions. The source code can be found in https://github.com/CIAM-Group/TTPL.
Hierarchical Optimization via LLM-Guided Objective Evolution for Mobility-on-Demand Systems
Yi Zhang · Yushen Long · Yun Ni · Liping Huang · Xiaohong Wang · Jun Liu
Online ride-hailing platforms aim to deliver efficient mobility-on-demand services, often facing challenges in balancing dynamic and spatially heterogeneous supply and demand. Existing methods typically fall into two categories: reinforcement learning (RL) approaches, which suffer from data inefficiency, oversimplified modeling of real-world dynamics, and difficulty enforcing operational constraints; or decomposed online optimization methods, which rely on manually designed high-level objectives that lack awareness of low-level routing dynamics. To address this issue, we propose a novel hybrid framework that integrates large language model (LLM) with mathematical optimization in a dynamic hierarchical system: (1) it is training-free, removing the need for large-scale interaction data as in RL, and (2) it leverages LLM to bridge cognitive limitations caused by problem decomposition by adaptively generating high-level objectives. Within this framework, LLM serves as a meta-optimizer, producing semantic heuristics that guide a low-level optimizer responsible for constraint enforcement and real-time decision execution. These heuristics are refined through a closed-loop evolutionary process, driven by harmony search, which iteratively adapts the LLM prompts based on feasibility and performance feedback from the optimization layer. Extensive experiments based on scenarios derived from both the New York and Chicago taxi datasets demonstrate the effectiveness of our approach, achieving an average improvement of 16% compared to state-of-the-art baselines.
KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows
Zaifeng Pan · AJJKUMAR DAHYALAL PATEL · Yipeng Shen · Zhengding Hu · Yue Guan · Wan-Lu Li · Lianhui Qin · Yida Wang · Yufei Ding
Large language model (LLM) based agentic workflows have become a popular paradigm for coordinating multiple specialized agents to solve complex tasks. To improve serving efficiency, existing LLM systems employ prefix caching to reuse key-value (KV) tensors corresponding to agents' fixed prompts, thereby avoiding redundant computation across repeated invocations. However, current systems typically evict KV caches using a Least Recently Used (LRU) policy, which fails to anticipate future agent usage and often discards KV caches shortly before their reuse. This leads to frequent cache misses and substantial recomputation or swap- ping overhead. We present KVFlow, a workflow-aware KV cache management framework tailored for agentic workloads. KVFlow abstracts the agent execution schedule as an Agent Step Graph and assigns each agent a steps-to-execution value that estimates its temporal proximity to future activation. These values guide a fine-grained eviction policy at the KV node level, allowing KVFlow to preserve entries likely to be reused and efficiently manage shared prefixes in tree-structured caches. Moreover, KVFlow introduces a fully overlapped KV prefetching mecha- nism, which proactively loads required tensors from CPU to GPU in background threads for agents scheduled in the next step, thereby avoiding cache miss stalls during generation. Compared to SGLang with hierarchical radix cache, KVFlow achieves up to 1.83× speedup for single workflows with large prompts, and up to 2.19× speedup for scenarios with many concurrent workflows.
Partial Correlation Network Estimation by Semismooth Newton Methods
DongWon Kim · Sungdong Lee · Joong-Ho (Johann) Won
We develop a scalable second-order algorithm for a recently proposed $\ell_1$-regularized pseudolikelihood-based partial correlation network estimation framework. While the latter method admits statistical guarantees and is inherently scalable compared to likelihood-based methods such as graphical lasso, the currently available implementations rely only on first-order information and require thousands of iterations to obtain reliable estimates even on high-performance supercomputers. In this paper, we further investigate the inherent scalability of the framework and propose locally and globally convergent semismooth Newton methods. Despite the nonsmoothness of the problem, these second-order algorithms converge at a locally quadratic rate, and require only a few tens of iterations in practice. Each iteration reduces to solving linear systems of small dimensions or linear complementary problems of smaller dimensions, making the computation also suitable for less powerful computing environments. Experiments on both simulated and real-world genomic datasets demonstrate the superior convergence behavior and computational efficiency of the proposed algorithm, which position our method as a promising tool for massive-scale network analysis sought for in, e.g., modern multi-omics research.
FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed Mixture-of-Experts Training
Yunqi Gao · Bing Hu · Boloursaz Mashhadi · A-Long Jin · Yanfeng Zhang · Pei Xiao · Rahim Tafazolli · Merouane DEBBAH
The parameter size of modern large language models (LLMs) can be scaled up to the trillion-level via the sparsely-activated Mixture-of-Experts (MoE) technique to avoid excessive increase of the computational costs. To further improve training efficiency, pipelining computation and communication has become a promising solution for distributed MoE training. However, existing work primarily focuses on scheduling tasks within the MoE layer, such as expert computing and all-to-all (A2A) communication, while neglecting other key operations including multi-head attention (MHA) computing, gating, and all-reduce communication. In this paper, we propose FlowMoE, a scalable framework for scheduling multi-type task pipelines. First, FlowMoE constructs a unified pipeline to consistently scheduling MHA computing, gating, expert computing, and A2A communication. Second, FlowMoE introduces a tensor chunk-based priority scheduling mechanism to overlap the all-reduce communication with all computing tasks. We implement FlowMoE as an adaptive and generic framework atop PyTorch. Extensive experiments with 675 typical MoE layers and four real-world MoE models across two GPU clusters demonstrate that our proposed FlowMoE framework outperforms state-of-the-art MoE training frameworks, reducing training time by14%-57%, energy consumption by 10%-39%, and memory usage by 7%-32%. FlowMoE’s code is anonymously available at https://anonymous.4open.science/r/FlowMoE.
Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo
Zachary Charles · Gabriel Teston · Lucio Dery · John Rush · Nova Fallen · Zachary Garrett · Arthur Szlam · Arthur Douillard
As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach (DiLoCo) that relaxes synchronization demands without compromising model quality. However, these works do not carefully analyze how DiLoCo's behavior changes with model size. In this work, we study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on how algorithmic factors, including number of model replicas, hyperparameters, and token budget affect training in ways that can be accurately predicted via scaling laws. We find that DiLoCo scales both predictably and robustly with model size. When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. Our results showcase a more general set of benefits of DiLoCo than previously documented, including increased optimal batch sizes, improved downstream generalization with scale, and improved evaluation loss for a fixed token budget.
ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
Xinhao Luo · Zihan Liu · Yangjie Zhou · Shihan Fang · Ziyu Huang · Yu Feng · Chen Zhang · Shixuan Sun · Zhenzhe Zheng · Jingwen Leng · Minyi Guo
Large language model (LLM) decoding suffers from high latency due to fragmented execution across operators and heavy reliance on off-chip memory for data exchange and reduction. This execution model limits opportunities for fusion and incurs significant memory traffic and kernel launch overhead. While modern architectures such as NVIDIA Hopper provide distributed shared memory and low-latency intra-cluster interconnects, they expose only low-level data movement instructions, lacking structured abstractions for collective on-chip communication. To bridge this software-hardware gap, we introduce two cluster-level communication primitives, ClusterReduce and ClusterGather, which abstract common communication patterns and enable structured, high-speed data exchange and reduction between thread blocks within a cluster, allowing intermediate results to be on-chip without involving off-chip memory. Building on these abstractions, we design ClusterFusion, an execution framework that schedules communication and computation jointly to expand operator fusion scope by composing decoding stages such as QKV Projection, Attention, and Output Projection into a single fused kernels. Evaluations on H100 GPUs show that ClusterFusion outperforms state-of-the-art inference frameworks by $1.61\times$ on average in end-to-end latency across different models and configurations.
A Near-Optimal Algorithm for Decentralized Convex-Concave Finite-Sum Minimax Optimization
Hongxu Chen · Ke Wei · Haishan Ye · Luo Luo
In this paper, we study the distributed convex-concave finite-sum minimax optimization over the network, and a decentralized variance-reduced optimistic gradient method with stochastic mini-batch sizes (DIVERSE) is proposed. For the strongly-convex-strongly-concave objective, it is shown that DIVERSE can achieve a linear convergence rate that depends on the global smoothness parameters, yielding sharper computation and communication complexity bounds than existing results. Furthermore, we also establish the lower complexity bounds, which show that our upper bounds are optimal up to a logarithmic factor in terms of the local incremental first-order oracle calls, the computation rounds, and the communication rounds. Numerical experiments demonstrate that our algorithm outperforms existing methods in practice.
Cost-aware LLM-based Online Dataset Annotation
Eray Can Elumar · Cem Tekin · Osman Yagan
Recent advances in large language models (LLMs) have enabled automated dataset labeling with minimal human supervision. While majority voting across multiple LLMs can improve label reliability by mitigating individual model biases, it incurs high computational costs due to repeated querying. In this work, we propose a novel online framework, Cost-aware Majority Voting (CaMVo), for efficient and accurate LLM-based dataset annotation. CaMVo adaptively selects a subset of LLMs for each data instance based on contextual embeddings, balancing confidence and cost without requiring pre-training or ground-truth labels. Leveraging a LinUCB-based selection mechanism and a Bayesian estimator over confidence scores, CaMVo estimates a lower bound on labeling accuracy for each LLM and aggregates responses through weighted majority voting. Our empirical evaluation on the MMLU and IMDB Movie Review datasets demonstrates that CaMVo achieves comparable or superior accuracy to full majority voting while significantly reducing labeling costs. This establishes CaMVo as a practical and robust solution for cost-efficient annotation in dynamic labeling environments.
Efficient semantic uncertainty quantification in language models via diversity-steered sampling
Ji Won Park · Kyunghyun Cho
Accurately estimating semantic aleatoric and epistemic uncertainties in large language models (LLMs) is particularly challenging in free-form question answering (QA), where obtaining stable estimates often requires many expensive generations. We introduce a diversity-steered sampler that discourages semantically redundant outputs during decoding, covers both autoregressive and masked diffusion paradigms, and yields substantial sample-efficiency gains. The key idea is to inject a continuous semantic-similarity penalty into the model’s proposal distribution using a natural language inference (NLI) model lightly finetuned on partial prefixes or intermediate diffusion states. We debias downstream uncertainty estimates with importance reweighting and shrink their variance with control variates. Across four QA benchmarks, our method matches or surpasses baselines while covering more semantic clusters with the same number of samples. Being modular and requiring no gradient access to the base LLM, the framework promises to serve as a drop-in enhancement for uncertainty estimation in risk-sensitive model deployments.
$O(\sqrt{T})$ Static Regret and Instance Dependent Constraint Violation for Constrained Online Convex Optimization
Rahul Vaze · Abhishek Sinha
The constrained version of the standard online convex optimization (OCO) framework, called COCO is considered, where on every round, a convex cost function and a convex constraint function are revealed to the learner after it chooses the action for that round. The objective is to simultaneously minimize the static regret and cumulative constraint violation (CCV). An algorithm is proposed that guarantees a static regret of $O(\sqrt{T})$ and a CCV of $\min\{{\cal V}, O(\sqrt{T}\log T) \}$, where ${\cal V}$ depends on the distance between the consecutively revealed constraint sets, the shape of constraint sets, dimension of action space and the diameter of the action space. When constraint sets have additional structure, ${\cal V}=O(1)$. Compared to the state of the art results, static regret of $O(\sqrt{T})$ and CCV of $O(\sqrt{T}\log T)$, that were universal, the new result on CCV is instance dependent, which is derived by exploiting the geometric properties of the constraint sets.
Sign-In to the Lottery: Reparameterizing Sparse Training
Advait Gadhikar · Tom Jacobs · chao zhou · Rebekka Burkholz
The performance gap between training sparse neural networks from scratch (PaI) and dense-to-sparse training presents a major roadblock for efficient deep learning. According to the Lottery Ticket Hypothesis, PaI hinges on finding a problem specific parameter initialization. As we show, to this end, determining correct parameter signs is sufficient. Yet, they remain elusive to PaI. To address this issue, we propose Sign-In, which employs a dynamic reparameterization that provably induces sign flips. Such sign flips are complementary to the ones that dense-to-sparse training can accomplish, rendering Sign-In as an orthogonal method. While our experiments and theory suggest performance improvements of PaI, they also carve out the main open challenge to close the gap between PaI and dense-to-sparse training.
Power Lines: Scaling laws for weight decay and batch size in LLM pre-training
Shane Bergsma · Nolan Dey · Gurpreet Gosal · Gavia Gray · Daria Soboleva · Joel Hestness
Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate η and weight decay λ. We study scaling laws for HPs: formulas for how to scale HPs as we scale model size N, dataset size D, and batch size B. Recent work suggests the AdamW timescale, τ = B/(ηλD), should remain constant across training settings, and we verify the implication that optimal λ scales linearly with B, for a fixed N and D. However, as N and D scale, we show optimal τ obeys a precise power law in the tokens-per-parameter ratio, D/N. This law thus provides a method to accurately predict λopt in advance of large-scale training. We also study scaling laws for optimal batch size Bopt (the B enabling lowest loss at a given N,D) and critical batch size Bcrit (the B beyond which further data parallelism becomes ineffective). In contrast to prior work, we find both Bopt and Bcrit scale as power laws in D, independent of model size, N. Finally, we analyze how these findings inform the real-world selection of Pareto-optimal N and D under dual training time and compute objectives.
Q3R: Quadratic Reweighted Rank Regularizer for Effective Low-Rank Training
Ipsita Ghosh · Ethan Nguyen · Christian Kümmerle
Parameter-efficient training, based on low-rank optimization, has become a highly successful tool for fine-tuning large deep-learning models. However, these methods fail at low-rank pre-training tasks where maintaining the low-rank structure and the objective remains a challenging task. We propose the Quadratic Reweighted Rank Regularizer dubbed Q3R, which leads to a novel low-rank inducing training strategy inspired by the iteratively reweighted least squares (IRLS) framework. Q3R is based on a quadratic regularizer term which majorizes a smoothed log determinant serving as rank surrogate objective. Unlike other low-rank training techniques, Q3R is able to train weight matrices with prescribed, low target ranks of models that achieve comparable predictive performance as dense models, with small computational overhead, while remaining fully compatible with existing architectures. In experiments, we are able to truncate 60% of the parameters of a ViT-Tiny parameters with marginal loss in CIFAR-10 performance and up to 80% with only 4% accuracy drop. The efficacy of Q3R is confirmed on Transformers across both image and language tasks, including for low-rank fine-tuning.
Momentum-SAM: Sharpness Aware Minimization without Computational Overhead
Marlon Becker · Frederick Altrock · Benjamin Risse
The recently proposed optimization algorithm for deep neural networks Sharpness Aware Minimization (SAM) suggests perturbing parameters before gradient calculation by a gradient ascent step to guide the optimization into parameter space regions of flat loss. While significant generalization improvements and thus reduction of overfitting could be demonstrated, the computational costs are doubled due to the additionally needed gradient calculation, making SAM unfeasible in case of limited computationally capacities. Motivated by Nesterov Accelerated Gradient (NAG) we propose Momentum-SAM (MSAM), which perturbs parameters in the direction of the accumulated momentum vector to achieve low sharpness without significant computational overhead or memory demands over SGD or Adam. We evaluate MSAM in detail and reveal insights on separable mechanisms of NAG, SAM and MSAM regarding training optimization and generalization.
FFN Fusion: Rethinking Sequential Computation in Large Language Models
Akhiad Bercovich · Mohammed Dabbah · Omri Puny · Ido Galil · Amnon Geifman · Yonatan Geifman · Izik Golan · Ehud Karpas · Itay Levy · Zach Moshe · Najeeb Nabwani · Tomer Ronen · Itamar Schen · Ido Shahaf · Oren Tropp · Ran Zilberstein · Ran El-Yaniv
We introduce \textit{FFN Fusion}, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization. Our key insight is that sequences of Feed-Forward Network (FFN) layers, particularly those remaining after the removal of specific attention layers, can often be parallelized with minimal accuracy impact. We develop a principled methodology for identifying and fusing such sequences, transforming them into parallel operations that significantly reduce inference latency while preserving model behavior. Applying these techniques to Llama-3.1-405B-Instruct, we create a 253B model (253B-Base), an efficient and soon-to-be publicly available model that achieves a 1.71$\times$ speedup in inference latency and 35$\times$ lower per-token cost while maintaining strong performance across benchmarks. Most intriguingly, we find that even full transformer blocks containing both attention and FFN layers can sometimes be parallelized, suggesting new directions for neural architecture design.
Time-Embedded Algorithm Unrolling for Computational MRI
Junno Yun · Yasar Utku Alcalar · Mehmet Akcakaya
Algorithm unrolling methods have proven powerful for solving the regularized least squares problem in computational magnetic resonance imaging (MRI). These approaches unfold an iterative algorithm with a fixed number of iterations, typically alternating between a neural network-based proximal operator for regularization, a data fidelity operation and auxiliary updates with learnable parameters. While the connection to optimization methods dictate that the proximal operator network should be shared across unrolls, this can introduce artifacts or blurring. Heuristically, practitioners have shown that using distinct networks may be beneficial, but this significantly increases the number of learnable parameters, making it challenging to prevent overfitting. To address these shortcomings, by taking inspirations from proximal operators with varying thresholds in approximate message passing (AMP) and the success of time-embedding in diffusion models, we propose a time-embedded algorithm unrolling scheme for inverse problems. Specifically, we introduce a novel perspective on the iteration-dependent proximal operation in vector AMP (VAMP) and the subsequent Onsager correction in the context of algorithm unrolling, framing them as a time-embedded neural network. Similarly, the scalar weights in the data fidelity operation and its associated Onsager correction are cast as time-dependent learnable parameters. Our extensive experiments on the fastMRI dataset, spanning various acceleration rates and datasets, demonstrate that our method effectively reduces aliasing artifacts and mitigates noise amplification, achieving state-of-the-art performance. Furthermore, we show that our time-embedding strategy extends to existing algorithm unrolling approaches, enhancing reconstruction quality without increasing the computational complexity significantly. Code available at https://github.com/JN-Yun/TE-Unrolling-MRI.
MUSTAFAR: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference
Donghyeon Joo · Helya Hosseini · Ramyad Hadidi · Bahar Asgari
We demonstrate that unstructured sparsity significantly improves KV cache compression for LLMs, enabling sparsity levels up to 70\% without compromising accuracy or requiring fine-tuning. We conduct a systematic exploration of pruning strategies and find per-token magnitude-based pruning as highly effective for both Key and Value caches under unstructured sparsity, surpassing prior structured pruning schemes. The Key cache benefits from prominent outlier elements, while the Value cache surprisingly benefits from a simple magnitude-based pruning despite its uniform distribution. KV cache size is the major bottleneck in decode performance due to high memory overhead for large context lengths. To address this, we use a bitmap-based sparse format and a custom attention kernel capable of compressing and directly computing over compressed caches pruned to arbitrary sparsity patterns, significantly accelerating memory-bound operations in decode computations and thereby compensating for the overhead of runtime pruning and compression. Our custom attention kernel coupled with the bitmap-based format delivers substantial compression of KV cache up to 45\% of dense inference and thereby enables longer context lengths and increased tokens/sec throughput of up to 2.23$\times$ compared to dense inference. Our pruning mechanism and sparse attention kernel is available at https://github.com/dhjoo98/mustafar.
Differentiable Sparsity via $D$-Gating: Simple and Versatile Structured Penalization
Chris Kolb · Laetitia Frost · Bernd Bischl · David Rügamer
Structured sparsity regularization offers a principled way to compact neural networks, but its non-differentiability breaks compatibility with conventional stochastic gradient descent and requires either specialized optimizers or additional post-hoc pruning without formal guarantees. In this work, we propose $D$-Gating, a fully differentiable structured overparameterization that splits each group of weights into a primary weight vector and multiple scalar gating factors. We prove that any local minimum under $D$-Gating is also a local minimum using non-smooth structured $L_{2,2/D}$ penalization, and further show that the $D$-Gating objective converges at least exponentially fast to the $L_{2,2/D}$–regularized loss in the gradient flow limit. Together, our results show that $D$-Gating is theoretically equivalent to solving the original group sparsity problem, yet induces distinct learning dynamics that evolve from a non-sparse regime into sparse optimization. We validate our theory across vision, language, and tabular tasks, where $D$-Gating consistently delivers strong performance–sparsity tradeoffs and outperforms both direct optimization of structured penalties and conventional pruning baselines.
Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation is Wasteful
Martin Marek · Sanae Lotfi · Aditya Somasundaram · Andrew Wilson · Micah Goldblum
Conventional wisdom dictates that small batch sizes make language model pretraining and fine-tuning unstable, motivating gradient accumulation, which trades off the number of optimizer steps for a proportional increase in batch size. While it is common to decrease the learning rate for smaller batch sizes, other hyperparameters are often held fixed. In this work, we revisit small batch sizes all the way down to batch size one, and we propose a rule for scaling Adam hyperparameters to small batch sizes. In particular, rather than holding the decay rate of the second moment fixed across batch sizes, we propose to hold its half-life fixed in terms of tokens. We find that small batch sizes (1) train stably, (2) are consistently more robust to hyperparameter choices, (3) achieve equal or better per-FLOP performance than larger batch sizes, and (4) notably enable stable language model training with vanilla SGD, even without momentum, despite storing no optimizer state. Building on these results, we provide practical recommendations for selecting a batch size and setting optimizer hyperparameters. We further recommend against gradient accumulation unless training on multiple devices with multiple model replicas. Finally, we show that a small batch size combined with an optimizer with a small state size can provide the performance benefits of full fine-tuning while maintaining a similar memory footprint to LoRA.
Data Efficient Adaptation in Large Language Models via Continuous Low-Rank Fine-Tuning
Xiao Han · ZIMO ZHAO · Wanyu Wang · Maolin Wang · Zitao Liu · Yi Chang · Xiangyu Zhao
Recent advancements in Large Language Models (LLMs) have emphasized the critical role of fine-tuning (FT) techniques in adapting LLMs to specific tasks, especially when retraining from scratch is computationally infeasible. Fine-tuning enables LLMs to leverage task- or domain-specific data, producing models that more effectively meet the requirements of targeted applications. However, con- ventional FT approaches often suffer from catastrophic forgetting and suboptimal data efficiency, limiting their real-world applicability. To address these challenges, this paper proposes DEAL, a novel framework that integrates Low-Rank Adapta- tion (LoRA) with a continuous fine-tuning strategy. By incorporating knowledge retention and adaptive parameter update modules, the framework mitigates the limitations of existing FT methods while maintaining efficiency. Experiments on 15 diverse datasets show that DEAL consistently outperforms baseline methods, yielding substantial gains in task accuracy and resource efficiency. These find- ings demonstrate the potential of our approach to advance continual adaptation in LLMs by enhancing task performance while improving resource efficiency. The source code is publicly available at https://github.com/Applied-Machine-Learning-Lab/DEAL.
Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning
Hossein Rajoli Nowdeh · Jie Ji · Xiaolong Ma · Fatemeh Afghah
In multimodal learning, dominant modalities often overshadow others, limiting generalization. We propose Modality-Aware Sharpness-Aware Minimization (M-SAM), a model-agnostic framework that applies to many modalities and supports early and late fusion scenarios. In every iteration, M-SAM in three steps optimizes learning. \textbf{First, it identifies the dominant modality} based on modalities' contribution in the accuracy using Shapley. \textbf{Second, it decomposes the loss landscape}, or in another language, it modulates the loss to prioritize the robustness of the model in favor of the dominant modality, and \textbf{third, M-SAM updates the weights} by backpropagation of modulated gradients. This ensures robust learning for the dominant modality while enhancing contributions from others, allowing the model to explore and exploit complementary features that strengthen overall performance. Extensive experiments on four diverse datasets show that M-SAM outperforms the latest state-of-the-art optimization and gradient manipulation methods and significantly balances and improves multimodal learning. The code will be released.
Neural Evolution Strategy for Black-box Pareto Set Learning
Chengyu Lu · Zhenhua Li · Xi Lin · Ji Cheng · Qingfu Zhang
Multi-objective optimization problems (MOPs) are prevalent in numerous real-world applications. Recently, Pareto Set Learning (PSL) has emerged as a powerful paradigm for solving MOPs. PSL can produce a neural network for modeling the set of all Pareto optimal solutions. However, applying PSL to black-box objectives, particularly those exhibiting non-separability, high dimensionality, and/or other complex properties, remains very challenging. To address this issue, we propose leveraging evolution strategies (ESs), a class of specialized black-box optimization algorithms, within the PSL paradigm. Traditional ESs capture the complex dimensional dependencies less efficiently, which can significantly hinder their performance in PSL. To tackle this issue, we suggest encapsulating the dependencies within a neural network, which is then trained using a novel gradient estimation method. The proposed method, termed Neural-ES, is evaluated using a bespoke benchmark suite for black-box PSL. Experimental comparisons with other methods demonstrate the efficiency of Neural-ES, underscoring its ability to learn the Pareto sets of challenging black-box MOPs.
Thompson Sampling in Function Spaces via Neural Operators
Rafael Oliveira · Xuesong Wang · Kian Ming Chai · Edwin Bonilla
We propose an extension of Thompson sampling to optimization problems over function spaces where the objective is a known functional of an unknown operator's output. We assume that queries to the operator (such as running a high-fidelity simulator or physical experiment) are costly, while functional evaluations on the operator's output are inexpensive. Our algorithm employs a sample-then-optimize approach using neural operator surrogates. This strategy avoids explicit uncertainty quantification by treating trained neural operators as approximate samples from a Gaussian process (GP) posterior. We derive regret bounds and theoretical results connecting neural operators with GPs in infinite-dimensional settings. Experiments benchmark our method against other Bayesian optimization baselines on functional optimization tasks involving partial differential equations of physical systems, demonstrating better sample efficiency and significant performance gains.
Online Portfolio Selection with ML Predictions
Ziliang Zhang · Tianming Zhao · Albert Zomaya
Online portfolio selection seeks to determine a sequence of allocations to maximize capital growth. Classical universal strategies asymptotically match the best constant-rebalanced portfolio but ignore potential forecasts, whereas heuristic methods often collapse when belief fails. We formalize this tension in a learning-augmented setting in which an investor observes (possibly erroneous) predictions prior to each decision moment, and we introduce the Rebalanced Arithmetic Mean portfolio with predictions (RAM). Under arbitrary return sequences, we prove that RAM captures at least a constant fraction of the hindsight-optimal wealth when forecasts are perfect while still exceeding the geometric mean of the sequence even when the predictions are adversarial. Comprehensive experiments on large-scale equity data strengthen our theory, spanning both synthetic prediction streams and production-grade machine-learning models. RAM advantages over universal-portfolio variants equipped with side information across various regimes. These results demonstrate that modest predictive power can be reliably converted into tangible gains without sacrificing worst-case guarantees.
Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation
Nan Bao · Yifan Zhao · Lin Zhu · Jia Li
Semantic segmentation has achieved great success in ideal conditions. However, when facing extreme conditions (e.g., insufficient light, fierce camera motion), most existing methods suffer from significant information loss of RGB, severely damaging segmentation results. Several researches exploit the high-speed and high-dynamic event modality as a complement, but event and RGB are naturally heterogeneous, which leads to feature-level mismatch and inferior optimization of existing multi-modality methods. Different from these researches, we delve into the edge secret of both modalities for resilient fusion and propose a novel Edge-awareness Semantic Concordance framework to unify the multi-modality heterogeneous features with latent edge cues. In this framework, we first propose Edge-awareness Latent Re-coding, which obtains uncertainty indicators while realigning event-RGB features into unified semantic space guided by re-coded distribution, and transfers event-RGB distributions into re-coded features by utilizing a pre-established edge dictionary as clues. We then propose Re-coded Consolidation and Uncertainty Optimization, which utilize re-coded edge features and uncertainty indicators to solve the heterogeneous event-RGB fusion issues under extreme conditions. We establish two synthetic and one real-world event-RGB semantic segmentation datasets for extreme scenario comparisons. Experimental results show that our method outperforms the state-of-the-art by a 2.55% mIoU on our proposed DERS-XS, and possesses superior resilience under spatial occlusion. Our code and datasets are publicly available at https://github.com/iCVTEAM/ESC.
Variational Learning Finds Flatter Solutions at the Edge of Stability
Avrajit Ghosh · Bai Cong · Rio Yokota · Saiprasad Ravishankar · Rongrong Wang · Molei Tao · Mohammad Emtiyaz Khan · Thomas Möllenhoff
Variational Learning (VL) has recently gained popularity for training deep neural networks. Part of its empirical success can be explained by theories such as PAC-Bayes bounds, minimum description length and marginal likelihood, but little has been done to unravel the implicit regularization in play. Here, we analyze the implicit regularization of VL through the Edge of Stability (EoS) framework. EoS has previously been used to show that gradient descent can find flat solutions and we extend this result to show that VL can find even flatter solutions. This result is obtained by controlling the shape of the variational posterior as well as the number of posterior samples used during training. The derivation follows in a similar fashion as in the standard EoS literature for deep learning, by first deriving a result for a quadratic problem and then extending it to deep neural networks. We empirically validate these findings on a wide variety of large networks, such as ResNet and ViT, to find that the theoretical results closely match the empirical ones. Ours is the first work to analyze the EoS dynamics of~VL.
Conformal Prediction in The Loop: A Feedback-Based Uncertainty Model for Trajectory Optimization
Han Wang · Chao Ning
Conformal Prediction (CP) is a powerful statistical machine learning tool to construct uncertainty sets with coverage guarantees, which has fueled its extensive adoption in generating prediction regions for decision-making tasks, e.g., Trajectory Optimization (TO) in uncertain environments. However, existing methods predominantly employ a sequential scheme, where decisions rely unidirectionally on the prediction regions, and consequently the information from decision-making fails to be fed back to instruct CP. In this paper, we propose a novel Feedback-Based CP (Fb-CP) framework for shrinking-horizon TO with a joint risk constraint over the entire mission time. Specifically, a CP-based posterior risk calculation method is developed by fully leveraging the realized trajectories to adjust the posterior allowable risk, which is then allocated to future times to update prediction regions. In this way, the information in the realized trajectories is continuously fed back to the CP, enabling attractive feedback-based adjustments of the prediction regions and a provable online improvement in trajectory performance. Furthermore, we theoretically prove that such adjustments consistently maintain the coverage guarantees of the prediction regions, thereby ensuring provable safety. Additionally, we develop a decision-focused iterative risk allocation algorithm with theoretical convergence analysis for allocating the posterior allowable risk which closely aligns with Fb-CP. Furthermore, we extend the proposed method to handle distribution shift. The effectiveness and superiority of the proposed method are demonstrated through benchmark experiments.