NeurIPS 2025 San Diego Journal Track Papers

Skip to yearly menu bar Skip to main content

Poster

Multivariate Dynamic Mediation Analysis under a Reinforcement Learning Framework

Lan Luo · Chengchun Shi · Jitao Wang · Zhenke Wu · Lexin Li

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Mediation analysis is an important analytic tool commonly used in a broad range of scientific applications. In this article, we study the problem of mediation analysis when there are multivariate and conditionally dependent mediators, and when the variables are observed over multiple time points. The problem is challenging, because the effect of a mediator involves not only the path from the treatment to this mediator itself at the current time point, but also all possible paths pointed to this mediator from its upstream mediators, as well as the carryover effects from all previous time points. We propose a novel multivariate dynamic mediation analysis approach. Drawing inspiration from the Markov decision process model that is frequently employed in reinforcement learning, we introduce a Markov mediation process paired with a system of time-varying linear structural equation models to formulate the problem. We then formally define the individual mediation effect, built upon the idea of simultaneous interventions and intervention calculus. We next derive the closed-form expression, propose an iterative estimation procedure under the Markov mediation process model, {and develop a bootstrap method to infer the individual mediation effect}. We study both the asymptotic property and the empirical performance of the proposed methodology, and further illustrate its usefulness with a mobile health application.

View full details

Poster

Asymptotic Theory of Geometric and Adaptive $k$-Means Clustering

Adam Quinn Jaffe

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

View full details

Poster

Statistical Inference for Decentralized Federated Learning

Jia Gu · Songxi Chen

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

This paper considers decentralized Federated Learning (FL) under het- erogeneous distributions among distributed clients or data blocks for the M- estimation. The mean squared error and consensus error across the estima- tors from different clients via the decentralized stochastic gradient descent algorithm are derived. The asymptotic normality of the Polyak–Ruppert (PR) averaged estimator in the decentralized distributed setting is attained, which shows that its statistical efficiency comes at a cost as it is more restrictive on the number of clients than that in the distributed M-estimation. To overcome the restriction, a one-step estimator is proposed which permits a much larger number of clients while still achieving the same efficiency as the original PR-averaged estimator in the nondistributed setting. The confidence regions based on both the PR-averaged estimator and the proposed one-step estimator are constructed to facilitate statistical inference for decentralized FL.

View full details

Poster

RLtools: A Fast, Portable Deep Reinforcement Learning Library for Continuous Control

Jonas Eschmann · Dario Albani · Giuseppe Loianno

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Deep Reinforcement Learning (RL) can yield capable agents and control policies in several domains but is commonly plagued by prohibitively long training times. Additionally, in the case of continuous control problems, the applicability of learned policies on real-world embedded devices is limited due to the lack of real-time guarantees and portability of existing libraries. To address these challenges, we present RLtools, a dependency-free, header-only, pure C++ library for deep supervised and reinforcement learning.Its novel architecture allows RLtools to be used on a wide variety of platforms, from HPC clusters over workstations and laptops to smartphones, smartwatches, and microcontrollers. Specifically, due to the tight integration of the RL algorithms with simulation environments, RL can solve popular RL problems up to 76 times faster than other popular RL frameworks.We also benchmark the inference on a diverse set of microcontrollers and show that in most cases our optimized implementation is by far the fastest. Finally, RLtools enables the first-ever demonstration of training a deep RL algorithm directly on a microcontroller, giving rise to the field of TinyRL. The source code as well as documentation and live demos are available through our project page at https://rl.tools.

View full details

Poster

REINFORCEMENT LEARNING FOR INDIVIDUAL OPTIMAL POLICY FROM HETEROGENEOUS DATA

Rui Miao · Babak Shahbaba · Annie Qu

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data. Learning from heterogeneous data is one of the fundamental challenges in offline RL. Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes, and thus may result in a suboptimal policy for a heterogeneous population. In this paper, we propose an individualized offline policy optimization framework for heterogeneous time-stationary Markov decision processes (MDPs). The proposed heterogeneous model with individual latent variables enables us to efficiently estimate the individual Q-functions, and our Penalized Pessimistic Personalized Policy Learning (P4L) algorithm guarantees a fast rate on the average regret under a weak partial coverage assumption on behavior policies. In addition, our simulation studies and a real data application demonstrate the superior numerical performance of the proposed method compared with existing methods.

View full details

Poster

Distributionally Robust Learning for Multi-source Unsupervised Domain Adaptation

Zhenyu Wang · Peter Bühlmann · Zijian Guo

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Empirical risk minimization often performs poorly when the distribution of the target domain differs from those of the source domains. To address such potential distributional shifts, we develop an unsupervised domain adaptation approach that leverages labeled data from multiple source domains and unlabeled data from the target domain. We introduce a distributionally robust model that optimizes an adversarial reward based on explained variance across a class of target distributions, ensuring generalization to the target domain. We show that the proposed robust model is a weighted average of conditional outcome models from the source domains. This formulation allows us to compute the robust model through the aggregation of source models, which can be estimated using various machine learning algorithms of the user’s choice such as random forests, boosting, and neural networks. Additionally, we introduce a bias-correction step to obtain a more accurate aggregation weight, which is effective for various machine learning algorithms. Our framework can be interpreted as a distributionally robust federated learning approach that satisfies privacy constraints while providing insights into the importance of each source for prediction on the target domain. The performance of our method is evaluated on both simulated and real data.

View full details

Poster

IMPROVED LEARNING THEORY FOR KERNEL DISTRIBUTION REGRESSION WITH TWO-STAGE SAMPLING

Alberto González-Sanz · François Bachoc · Jean-Michel Loubes · Louis Béthune

Dec 3, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The distribution regression problem encompasses many important statistics and machine learning tasks, and arises in a large range of applications. Among various existing approaches to tackle this problem, kernel methods have become a method of choice. Indeed, kernel distribution regression is both computationally favorable, and supported by a recent learning theory. This theory also tackles the two-stage sampling setting, where only samples from the input distributions are available. In this paper, we improve the learning theory of kernel distribution regression. We address kernels based on Hilbertian embeddings, that encompass most, if not all, of the existing approaches. We introduce the novel near-unbiased condition on the Hilbertian embeddings, that enables us to provide new error bounds on the effect of the two-stage sampling, thanks to a new analysis. We show that this near-unbiased condition holds for three important classes of kernels, based on optimal transport and mean embedding. As a consequence, we strictly improve the existing convergence rates for these kernels. Our setting and results are illustrated by numerical experiments.

View full details

Poster

A duality framework for analyzing random feature and two-layer neural networks

Hongrui Chen · Jihao Long · Lei Wu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

"We consider the problem of learning functions within the $\mathcal{F}_{p,\pi}$ and Barron spaces, which play crucial roles in understanding random feature models (RFMs), two-layer neural networks, as well as kernel methods. Leveraging tools from information-based complexity (IBC), we establish a dual equivalence between approximation and estimation, and then apply it to study the learning of the preceding function spaces. The duality allows us to focus on the more tractable problem between approximation and estimation. To showcase the efficacy of our duality framework, we delve into two important but under-explored problems: \begin{itemize} \item \textbf{Random feature learning beyond kernel regime:} We derive sharp bounds for learning $\cF_{p,\pi}$ using RFMs. Notably, the learning is efficient without the curse of dimensionality for $p>1$. This underscores the extended applicability of RFMs beyond the traditional kernel regime, since $\mathcal{F}_{p,\pi}$ with $p<2$ is strictly larger than the corresponding reproducing kernel Hilbert space (RKHS) where $p=2$. \item \textbf{The $L^\infty$ learning of RKHS:} We establish sharp, spectrum-dependent characterizations for the convergence of $L^\infty$ learning error in both noiseless and noisy settings. Surprisingly, we show that popular kernel ridge regression can achieve near-optimal performance in $L^\infty$ learning, despite it primarily minimizing square loss. \end{itemize} To establish the aforementioned duality, we introduce a type of IBC, termed $I$-complexity, to measure the size of a function class. Notably, $I$-complexity offers a tight characterization of learning in noiseless settings, yields lower bounds comparable to Le Cam's in noisy settings, and is versatile in deriving upper bounds. We believe that our duality framework holds potential for broad application in learning analysis across more scenarios."

View full details

Poster

On the Convergence of Projected Policy Gradient for Any Constant Step Sizes

Jiacai Liu · Wenye Li · Dachao Lin · Ke Wei · Zhihua Zhang

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Projected policy gradient (PPG) is a basic policy optimization method in reinforcement learning. Given access to exact policy evaluations, previous studies have established the sublinear convergence of PPG for sufficiently small step sizes based on the smoothness and the gradient domination properties of the value function. However, as the step size goes to infinity, PPG reduces to the classic policy iteration method, which suggests the convergence of PPG even for large step sizes. In this paper, we fill this gap and show that PPG admits a sublinear convergence for any constant step sizes. Due to the existence of the state-wise visitation measure in the expression of policy gradient, the existing optimization-based analysis framework for a preconditioned version of PPG (i.e., projected Q-ascent) is not applicable, to the best of our knowledge. Instead, we proceed the proof by computing the state-wise improvement lower bound of PPG based on its inherent structure. In addition, the finite iteration convergence of PPG for any constant step size is further established, which is also new.

View full details

Poster

Deep Nonlinear Sufficient Dimension Reduction

Yinfeng Chen · Yuling Jiao · Rui Qiu · Zhou Yu

Dec 3, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Linear sufficient dimension reduction, as exemplified by sliced inverse regression, has seen substantial development in the past thirty years. However, with the advent of more complex scenarios, nonlinear dimension reduction has gained considerable interest recently. This paper introduces a novel method for nonlinear sufficient dimension reduction, utilizing the generalized martingale difference divergence measure in conjunction with deep neural networks. The optimal solution of the proposed objective function is shown to be unbiased at the general level of $\sigma$-fields. And two optimization schemes, based on the fascinating deep neural networks, exhibit higher efficiency and flexibility compared to the classical eigendecomposition of linear operators.Moreover, we systematically investigate the slow rate and fast rate for the estimation error based on advanced $U$-process theory. Remarkably, the fast rate almost coincides with the minimax rate of nonparametric regression. The validity of our deep nonlinear sufficient dimension reduction methods is demonstrated through simulations and real data analysis.

View full details

Poster

Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks

Fanghui Liu · Leello Tadesse Dadi · Volkan Cevher

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks as the curse of dimensionality (CoD) cannot be evaded when trying to approximate even a single ReLU neuron. In this paper, we study a suitable function space for over-parameterized two-layer neural networks with bounded norms (e.g., the path norm, the Barron norm) in the perspective of sample complexity and generalization properties. First, we show that the path norm (as well as the Barron norm) is able to obtain width-independence sample complexity bounds, which allows for uniform convergence guarantees. Based on this result, we derive the improved result of metric entropy for $\epsilon$-covering up to $O(\epsilon^{-\frac{2d}{d+2}})$ ($d$ is the input dimension and the depending constant is at most polynomial order of $d$) via the convex hull technique, which demonstrates the separation with kernel methods with $\Omega(\epsilon^{-d})$ to learn the target function in a Barron space. Second, this metric entropy result allows for building a sharper generalization bound under a general moment hypothesis setting, achieving the rate at $O(n^{-\frac{d+2}{2d+2}})$. Our analysis is novel in that it offers a sharper and refined estimation for metric entropy (with a clear dependence relationship on the dimension $d$) and unbounded sampling in the estimation of the sample error and the output error.

View full details

Poster

A Statistical Framework of Watermarks for Large Language Models: Pivot, Detection Efficiency and Optimal Rules

Xiang Li · Feng Ruan · Huiyuan Wang · Qi Long · Weijie Su

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Since ChatGPT was introduced in November 2022, embedding (nearly) unnoticeable statistical signals into text generated by large language models (LLMs), also known as watermarking, has been used as a principled approach to provable detection of LLM-generated text from its human-written counterpart. In this paper, we introduce a general and flexible framework for reasoning about the statistical efficiency of watermarks and designing powerful detection rules. Inspired by the hypothesis testing formulation of watermark detection, our framework starts by selecting a pivotal statistic of the text and a secret key---provided by the LLM to the verifier---to control the false positive rate (the error of mistakenly detecting human-written text as LLM-generated). Next, this framework allows one to evaluate the power of watermark detection rules by obtaining a closed-form expression of the asymptotic false negative rate (the error of incorrectly classifying LLM-generated text as human-written). Our framework further reduces the problem of determining the optimal detection rule to solving a minimax optimization program. We apply this framework to two representative watermarks---one of which has been internally implemented at OpenAI---and obtain several findings that can be instrumental in guiding the practice of implementing watermarks. In particular, we derive optimal detection rules for these watermarks under our framework. These theoretically derived detection rules are demonstrated to be competitive and sometimes enjoy a higher power than existing detection approaches through numerical experiments.

View full details

Poster

Unified Algorithms for RL with Decision-Estimation Coefficients: PAC, Reward-Free, Preference-Based Learning, and Beyond

Fan Chen · Song Mei · Yu Bai

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Modern Reinforcement Learning (RL) is more than just learning the optimal policy; Alternative learning goals such as exploring the environment, estimating the underlying model, and learning from preference feedback are all of practical importance. While provably sample-efficient algorithms for each specific goal have been proposed, these algorithms often depend strongly on the particular learning goal and thus admit different structures correspondingly. It is an urging open question whether these learning goals can rather be tackled by a single unified algorithm. We make progress on this question by developing a unified algorithm framework for a large class of learning goals, building on the Decision-Estimation Coefficient (DEC) framework. Our framework handles many learning goals such as no-regret RL, PAC RL, reward-free learning, model estimation, and preference-based learning, all by simply instantiating the same generic complexity measure called "Generalized DEC", and a corresponding generic algorithm. The generalized DEC also yields a sample complexity lower bound for each specific learning goal. As applications, we propose "decouplable representation" as a natural sufficient condition for bounding generalized DECs, and use it to obtain many new sample-efficient results (and recover existing results) for a wide range of learning goals and problem classes as direct corollaries. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling and Maximum Likelihood Estimation, showing that they enjoy sample complexity bounds under similar structural conditions as the DEC.

View full details

Poster

Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case

Iskander Azangulov · Andrei Smolensky · Alexander Terenin · Viacheslav (Slava) Borovitskiy

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.

View full details

Poster

Data-Driven Performance Guarantees for Classical and Learned Optimizers

Rajiv Sambharya · Bartolomeo Stellato

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We introduce a data-driven approach to analyze the performance of continuous optimization algorithms using generalization guarantees from statistical learning theory. We study classical and learned optimizers to solve families of parametric optimization problems. We build generalization guarantees for classical optimizers, using a sample convergence bound, and for learned optimizers, using the Probably Approximately Correct (PAC)-Bayes framework. To train learned optimizers, we use a gradient-based algorithm to directly minimize the PAC-Bayes upper bound. Numerical experiments in signal processing, control, and meta-learning showcase the ability of our framework to provide strong generalization guarantees for both classical and learned optimizers given a fixed budget of iterations. For classical optimizers, our bounds which hold with high probability are much tighter than those that worst-case guarantees provide. For learned optimizers, our bounds outperform the empirical outcomes observed in their non-learned counterparts.

View full details

Poster

Interpretable Global Minima of Deep ReLU Neural Networks on Sequentially Separable Data

Thomas Chen · Patricia Muñoz Ewald

Dec 4, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We explicitly construct zero loss neural network classifiers. We write the weight matrices and bias vectors in terms of cumulative parameters, which determine truncation maps acting recursively on input space. The configurations for the training data considered are $(i)$ sufficiently small, well separated clusters corresponding to each class, and $(ii)$ equivalence classes which are sequentially linearly separable. In the best case, for $Q$ classes of data in $\mathbb{R}^{M}$, global minimizers can be described with $Q(M+2)$ parameters.

View full details

Poster

Uniform Generalization Bounds on Data-Dependent Hypothesis Sets via PAC-Bayesian Theory on Random Sets

Benjamin Dupuis · Paul Viallard · George Deligiannidis · Umut Simsekli

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We propose data-dependent uniform generalization bounds by approaching the problem from a PAC-Bayesian perspective. We first apply the PAC-Bayesian framework on ``random sets'' in a rigorous way, where the training algorithm is assumed to output a data-dependent hypothesis set after observing the training data. This approach allows us to prove data-dependent bounds, which can be applicable in numerous contexts. To highlight the power of our approach, we consider two main applications. First, we propose a PAC-Bayesian formulation of the recently developed fractal-dimension-based generalization bounds. The derived results are shown to be tighter and they unify the existing results around one simple proof technique. Second, we prove uniform bounds over the trajectories of continuous Langevin dynamics and stochastic gradient Langevin dynamics. These results provide novel information about the generalization properties of noisy algorithms.

View full details

Poster

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

Hao Liang · Zhiquan Luo

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distributional reinforcement learning (DRL) methods. In particular, we consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. By leveraging a key property of the EntRM, the independence property, we establish the risk-sensitive distributional dynamic programming framework. We then propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one. We prove that they both attain $\tilde{\mathcal{O}}\left(\frac{\exp(|\beta| H)-1}{|\beta|}H\sqrt{S^2AK}\right)$ regret upper bound, where $S$, $A$, $K$, $H$, $T=KH$, and $\beta$ represent the number of states, actions, episodes, time horizon, number of total time-steps, and risk parameter respectively. It matches RSVI2, with novel distributional analysis that focuses on the distributions of returns rather than the risk values associated with these returns. To the best of our knowledge, this is the first regret analysis that bridges DRL and RSRL in terms of sample complexity. To address the computational inefficiencies inherent in the model-free DRL algorithm, we propose an alternative DRL algorithm with distribution representation. This approach effectively represents any bounded distribution using a refined distribution class. It significantly amplifies computational efficiency while maintaining the established regret bounds.We also prove a tighter minimax lower bound of $\Omega\left(\frac{\exp(\beta H/6)-1}{\beta }\sqrt{SAT}\right)$ for the $\beta>0$ case, which recovers the tight lower bound $\Omega(H\sqrt{SAT})$ in the risk-neutral setting.

View full details

Poster

Pseudo-Labeling for Kernel Ridge Regression under Covariate Shift

Kaizheng Wang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We develop and analyze a principled approach to kernel ridge regression under covariate shift. The goal is to learn a regression function with small mean squared error over a target distribution, based on unlabeled data from there and labeled data that may have a different feature distribution. We propose to split the labeled data into two subsets, and conduct kernel ridge regression on them separately to obtain a collection of candidate models and an imputation model. We use the latter to fill the missing labels and then select the best candidate accordingly. Our non-asymptotic excess risk bounds demonstrate that our estimator adapts effectively to both the structure of the target distribution and the covariate shift. This adaptation is quantified through a notion of effective sample size that reflects the value of labeled source data for the target regression task. Our estimator achieves the minimax optimal error rate up to a polylogarithmic factor, and we find that using pseudo-labels for model selection does not significantly hinder performance.

View full details

Poster

Policy learning “without” overlap: Pessimism and generalized empirical Bernstein’s inequality

Ying Jin · Zhimei Ren · Zhuoran Yang · Zhaoran Wang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn an optimal individualized decision rule that achieves the best overall outcomes for a given population. Existing policy learning methods rely on a uniform overlap assumption, that is, the propensities of exploring all actions for all individual characteristics must be lower bounded; the performance of the existing methods thus depends on the worst-case propensity in the offline data set. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities. In this paper, we propose Pessimistic Policy Learning (PPL), a new algorithm that optimizes lower confidence bounds (LCBs)—instead of point estimates—of the policy values. The LCBs are constructed using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which only depends on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class we optimize over. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized type concentration inequality for inverse-propensity-weighting estimators, generalizing the well-known empirical Bernstein’s inequality to unbounded and non-i.i.d. data. We complement our theory with an efficient optimization algorithm via Majorization–Minimization and policy tree search, as well as extensive simulation studies and real-world applications that demonstrate the efficacy of PPL.

View full details

Poster

A Geometrical Analysis of Kernel Ridge Regression and its Applications

Georgios Gavrilopoulos · Guillaume Lecué · Zong Shang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We obtain upper bounds for the estimation error of Kernel Ridge Regression (KRR) for all non-negative regularization parameters, offering a geometric perspective on various phenomena in KRR. As applications: 1. We address the multiple descent problem, unifying the proofs of arXiv:1908.10292 and arXiv:1904.12191 for polynomial kernels and we establish multiple descent for the upper bound of estimation error of KRR under sub-Gaussian design and non-asymptotic regimes. 2. For a sub-Gaussian design vector and for non-asymptotic scenario, we prove a one-sided isomorphic version of the Gaussian Equivalent Conjecture. 3. We offer a novel perspective on the linearization of kernel matrices of non-linear kernel, extending it to the power regime for polynomial kernels. 4. Our theory is applicable to data-dependent kernels, providing a convenient and accurate tool for the feature learning regime in deep learning theory. 5. Our theory extends the results in arXiv:2009.14286 under weak moment assumption. Our proof is based on three mathematical tools developed in this paper that can be of independent interest: 1. Dvoretzky-Milman theorem for ellipsoids under (very) weak moment assumptions. 2. Restricted Isomorphic Property in Reproducing Kernel Hilbert Spaces with embedding index conditions. 3. A concentration inequality for finite-degree polynomial kernel functions.

View full details

Poster

Estimation and Inference in Distributional Reinforcement Learning

Liangyu Zhang · Yang Peng · Jiadong Liang · Wenhao Yang · Zhihua Zhang

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete return distribution (denoted \eta^\pi) attained by a given policy \pi. We use the certainty-equivalence method to construct our estimator \hat\eta^\pin, based on a generative model. In this circumstance we need a dataset of size \widetilde O(|\mathcal{S}||\mathcal{A}|{\varepsilon^{-2p}(1-\gamma)^{-(2p+2)}}) to guarantee the supremum p-Wasserstein metric between \hat\eta^\pin and \eta^\pi less than \varepsilon with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size \widetilde O({|\mathcal{S}||\mathcal{A}|}{\varepsilon^{-2}(1-\gamma)^{-4}}) suffices to ensure the supremum Kolmogorov-Smirnov metric and supremum total variation metric between \hat\eta^\pin and \eta^\pi is below \varepsilon with high probability. Furthermore, we investigate the asymptotic behavior of \hat\eta^\pin. We demonstrate that the ``empirical process'' \sqrt{n}(\hat\eta^\pin-\eta^\pi) converges weakly to a Gaussian process in the space of bounded functionals on a Lipschitz function class \ell^\infty(\mathcal{F}{W1}), also in the space of bounded functionals on an indicator function class \ell^\infty(\mathcal{F}{\textup{KS}}) and a bounded measurable function class \ell^\infty(\mathcal{F}_{\textup{TV}}) when some mild conditions hold.

View full details

Poster

Neural Networks Generalize on Low Complexity Data

Sourav Chatterjee · Timothy Sudijono

Dec 4, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We show that feedforward neural networks with ReLU activation generalize on low complexity data, suitably defined. Given i.i.d.~data generated from a simple programming language, the minimum description length (MDL) feedforward neural network which interpolates the data generalizes with high probability. We define this simple programming language, along with a notion of description length of such networks. We provide several examples on basic computational tasks, such as checking primality of a natural number. For primality testing, our theorem shows the following and more. Suppose that we draw an i.i.d.~sample of $n$ numbers uniformly at random from $1$ to $N$. For each number $x_i$, let $y_i = 1$ if $x_i$ is a prime and $0$ if it is not. Then, the interpolating MDL network accurately answers, with error probability $1- O((\ln N)/n)$, whether a newly drawn number between $1$ and $N$ is a prime or not. Note that the network is not \textit{designed} to detect primes; minimum description learning \textit{discovers} a network which does so. Extensions to noisy data are also discussed, suggesting that MDL neural network interpolators can demonstrate tempered overfitting.

View full details

Poster

Online Statistical Inference in Decision Making with Matrix Context

Qiyu Han · Will Wei Sun · Yichen Zhang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

The study of online decision-making problems that leverage contextual information has drawn notable attention due to their significant applications in fields ranging from healthcare to autonomous systems. In modern applications, contextual information can be rich and is often represented as a matrix. Moreover, while existing online decision algorithms mainly focus on reward maximization, less attention has been devoted to statistical inference. To address these gaps, in this work, we consider an online decision-making problem with a matrix context where the true model parameters have a lowrank structure. We propose a fully online procedure to conduct statistical inference with adaptively collected data. The low-rank structure of the model parameter and the adaptive nature of the data collection process make this difficult: standard low-rank estimators are biased and cannot be obtained in a sequential manner while existing inference approaches in sequential decision making algorithms fail to account for the low-rankness and are also biased. To overcome these challenges, we introduce a new online debiasing procedure to simultaneously handle both sources of bias. Our inference framework encompasses both parameter inference and optimal policy value inference. In theory, we establish the asymptotic normality of the proposed online debiased estimators and prove the validity of the constructed confidence intervals for both inference tasks. Our inference results are built upon a newly developed low-rank stochastic gradient descent estimator and its convergence result, which are also of independent interest.

View full details

Poster

Stochastic-Constrained Stochastic Optimization with Markovian Data

Yeongjong Kim · Dabeen Lee

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

This paper considers stochastic-constrained stochastic optimization where the stochastic constraint is to satisfy that the expectation of a random function is below a certain threshold. In particular, we study the setting where data samples are drawn from a Markov chain and thus are not independent and identically distributed. We generalize the drift-plus-penalty framework, a primal-dual stochastic gradient method developed for the i.i.d. case, to the Markov chain sampling setting. We propose three variants of drift-plus-penalty; two are for the case when the mixing time of the underlying Markov chain is known while the other is for the case of unknown mixing time. In fact, our algorithms apply to a more general setting of constrained online convex optimization where the sequence of constraint functions follows a Markov chain. The algorithms are adaptive in that the first two work without knowledge of the time horizon while the third uses AdaGrad-style algorithm parameters, which is of independent interest. We demonstrate the effectiveness of our proposed methods through numerical experiments on classification with fairness constraints.

View full details

Poster

Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces II: non-compact symmetric spaces

Iskander Azangulov · Andrei Smolensky · Alexander Terenin · Viacheslav (Slava) Borovitskiy

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.

View full details

Poster

Fine-grained Analysis and Faster Algorithms for Iteratively Solving Linear Systems

Michal Derezinski · Daniel LeJeune · Deanna Needell · Elizaveta Rebrova

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Despite being a key bottleneck in many machine learning tasks, the cost of solving large linear systems has proven challenging to quantify due to problem-dependent quantities such as condition numbers.To tackle this, we consider a fine-grained notion of complexity for solving linear systems, which is motivated by applications where the data exhibits low-dimensional structure, including spiked covariance models and kernel machines, and when the linear system is explicitly regularized, such as ridge regression. Concretely, let $\kappa_\ell$ be the ratio between the $\ell$th largest and the smallest singular value of $n\times n$ matrix $A$. We give a stochastic algorithm based on the Sketch-and-Project paradigm, that solves the linear system $Ax=b$ in time $\tilde O(\kappa_\ell\cdot n^2\log1/\epsilon)$ for any $\ell = O(n^{0.729})$.This is a direct improvement over preconditioned conjugate gradient, and it provides a stronger separation between stochastic linear solvers and algorithms accessing $A$ only through matrix-vector products.Our main technical contribution is the new analysis of the first and second moments of the random projection matrix that arises in Sketch-and-Project.

View full details

Poster

Geometric Learning with Positively Decomposable Kernels

Nathael Da Costa · Cyrus Mostajeran · Juan-Pablo Ortega · Salem Said

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Kernel methods are powerful tools in machine learning. Classical kernel methods are based on positive definite kernels, which enable learning in reproducing kernel Hilbert spaces (RKHS). For non-Euclidean data spaces, positive definite kernels are difficult to come by. In this case, we propose the use of reproducing kernel Krein space (RKKS) based methods, which require only kernels that admit a positive decomposition. We show that one does not need to access this decomposition to learn in RKKS. We then investigate the conditions under which a kernel is positively decomposable. We show that invariant kernels admit a positive decomposition on homogeneous spaces under tractable regularity assumptions. This makes them much easier to construct than positive definite kernels, providing a route for learning with kernels for non-Euclidean data. By the same token, this provides theoretical foundations for RKKS-based methods in general.

View full details

Poster

Online Estimation and Inference for Robust Policy Evaluation in Reinforcement Learning

Weidong Liu · Jiyuan Tu · Xi Chen · Yichen Zhang

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

Reinforcement learning has emerged as one of the prominent topics attracting attention in modern statistical learning, with policy evaluation being a key component. Unlike the traditional machine learning literature on this topic, our work emphasizes statistical inference for the model parameters and value functions of reinforcement learning algorithms. While most existing analyses assume random rewards to follow standard distributions, we embrace the concept of robust statistics in reinforcement learning by simultaneously addressing issues of outlier contamination and heavy-tailed rewards within a unified framework. In this paper, we develop a fully online robust policy evaluation procedure, and establish the Bahadur-type representation of our estimator. Furthermore, we develop an online procedure to efficiently conduct statistical inference based on the asymptotic distribution. This paper connects robust statistics and statistical inference in reinforcement learning, offering a more versatile and reliable approach to online policy evaluation. Finally, we validate the efficacy of our algorithm through numerical experiments conducted in simulations and real-world reinforcement learning experiments.

View full details

Poster

TESTING STATIONARITY AND CHANGE POINT DETECTION IN REINFORCEMENT LEARNING

Mengbing Li · Chengchun Shi · Zhenke Wu · Piotr Fryzlewicz

Dec 5, 11:00 AM - 2:00 PM Exhibit Hall C,D,E

We consider reinforcement learning (RL) in possibly nonstationary environments. Many existing RL algorithms in the literature rely on the stationarity assumption that requires the state transition and reward functions to be constant over time. However, this assumption is restrictive in practice and is likely to be violated in a number of applications, including traffic signal control, robotics and mobile health. In this paper, we develop a modelfree test to assess the stationarity of the optimal Q-function based on pre-collected historical data, without additional online data collection. Based on the proposed test, we further develop a change point detection method that can be naturally coupled with existing state-of-the-art RL methods designed in stationary environments for online policy optimization in nonstationary environments. The usefulness of our method is illustrated by theoretical results, simulation studies, and a real data example from the 2018 Intern Health Study. A Python implementation of the proposed procedure is publicly available at https://github.com/limengbinggz/CUSUM-RL.

View full details

Poster

Robust Transfer Learning with Unreliable Source Data

Jianqing Fan · Cheng Gao · Jason Klusowski

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

This paper addresses challenges in robust transfer learning stemming from ambiguity in Bayes classifiers and weak transferable signals between the target and source distribution. We introduce a novel quantity called the ''ambiguity level'' that measures the discrepancy between the target and source regression functions, propose a simple transfer learning procedure, and establish a general theorem that shows how this new quantity is related to the transferability of learning in terms of risk improvements. Our proposed ''Transfer Around Boundary'' (TAB) model, with a threshold balancing the performance of target and source data, is shown to be both efficient and robust, improving classification while avoiding negative transfer. Moreover, we demonstrate the effectiveness of the TAB model on non-parametric classification and logistic regression tasks, achieving upper bounds which are optimal up to logarithmic factors. Simulation studies lend further support to the effectiveness of TAB. We also provide simple approaches to bound the excess misclassification error without the need for specialized knowledge in transfer learning.

View full details

Poster

Versatile differentially private learning for general loss functions

Qilong Lu · Songxi Chen · Yumou Qiu

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

This paper aims to provide a versatile privacy-preserving release mechanism along with a unified approach for subsequent parameter estimation and statistical inference. We propose a privacy mechanism based on zero-inflated symmetric multivariate Laplace (ZIL) noise, which requires no prior specification of subsequent analysis tasks, allows for general loss functions under minimal conditions, imposes no limit on the number of analyses, and is adaptable to increasing data volume in online scenarios. We derive the trade-off function for the proposed ZIL mechanism, which characterizes its privacy protection level. Furthermore, to formalize the local differential privacy (LDP) property of the ZIL mechanism, we extend the classical $\varepsilon$-LDP to a more general $f$-LDP framework. To address scenarios where only individual attribute values require protection, we propose attribute-level differential privacy (ADP) and its local version. Within the M-estimation framework, we introduce a novel doubly random (DR) corrected loss for the ZIL mechanism, which yields consistent and asymptotically normal M-estimates under differential privacy constraints. The proposed approach is computationally efficient and does not require numerical integration or differentiation for noisy data. It applies to a broad class of loss functions, including non-smooth ones. Two alternative estimators for smooth loss are also proposed with asymptotic properties. The cost of privacy in terms of estimation efficiency for these three estimators is evaluated both theoretically and numerically.

View full details

Poster

Dropout Regularization Versus l2-Penalization in the Linear Model

Gabriel Clara · Sophie Langer · Johannes Schmidt-Hieber

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

We investigate the statistical behavior of gradient descent iterates with dropout in the linear regression model. In particular, non-asymptotic bounds for the convergence of expectations and covariance matrices of the iterates are derived. The results shed more light on the widely cited connection between dropout and $\ell_2$-regularization in the linear model. We indicate a more subtle relationship, owing to interactions between the gradient descent dynamics and the additional randomness induced by dropout. Further, we study a simplified variant of dropout which does not have a regularizing effect and converges to the least squares estimator.

View full details

Poster

The ODE Method for Stochastic Approximation and Reinforcement Learning with Markovian Noise

Shuze Daniel Liu · Shuhang Chen · Shangtong Zhang

Dec 5, 4:30 PM - 7:30 PM Exhibit Hall C,D,E

Stochastic approximation is a class of algorithms that update a vector iteratively, incrementally, and stochastically, including, e.g., stochastic gradient descent and temporal difference learning. One fundamental challenge in analyzing a stochastic approximation algorithm is to establish its stability, i.e., to show that the stochastic vector iterates are bounded almost surely. In this paper, we extend the celebrated Borkar-Meyn theorem for stability from the Martingale difference noise setting to the Markovian noise setting, which greatly improves its applicability in reinforcement learning, especially in those off-policy reinforcement learning algorithms with linear function approximation and eligibility traces. Central to our analysis is the diminishing asymptotic rate of change of a few functions, which is implied by both a form of the strong law of large numbers and a form of the law of the iterated logarithm.

View full details