San Diego Journal Track Papers
Multivariate Dynamic Mediation Analysis under a Reinforcement Learning Framework
Mediation analysis is an important analytic tool commonly used in a broad range of scientific applications. In this article, we study the problem of mediation analysis when there are multivariate and conditionally dependent mediators, and when the variables are observed over multiple time points. The problem is challenging, because the effect of a mediator involves not only the path from the treatment to this mediator itself at the current time point, but also all possible paths pointed to this mediator from its upstream mediators, as well as the carryover effects from all previous time points. We propose a novel multivariate dynamic mediation analysis approach. Drawing inspiration from the Markov decision process model that is frequently employed in reinforcement learning, we introduce a Markov mediation process paired with a system of time-varying linear structural equation models to formulate the problem. We then formally define the individual mediation effect, built upon the idea of simultaneous interventions and intervention calculus. We next derive the closed-form expression, propose an iterative estimation procedure under the Markov mediation process model, {and develop a bootstrap method to infer the individual mediation effect}. We study both the asymptotic property and the empirical performance of the proposed methodology, and further illustrate its usefulness with a mobile health application.
Distributionally Robust Learning for Multi-source Unsupervised Domain Adaptation
Empirical risk minimization often performs poorly when the distribution of the target domain differs from those of the source domains. To address such potential distributional shifts, we develop an unsupervised domain adaptation approach that leverages labeled data from multiple source domains and unlabeled data from the target domain. We introduce a distributionally robust model that optimizes an adversarial reward based on explained variance across a class of target distributions, ensuring generalization to the target domain. We show that the proposed robust model is a weighted average of conditional outcome models from the source domains. This formulation allows us to compute the robust model through the aggregation of source models, which can be estimated using various machine learning algorithms of the user’s choice such as random forests, boosting, and neural networks. Additionally, we introduce a bias-correction step to obtain a more accurate aggregation weight, which is effective for various machine learning algorithms. Our framework can be interpreted as a distributionally robust federated learning approach that satisfies privacy constraints while providing insights into the importance of each source for prediction on the target domain. The performance of our method is evaluated on both simulated and real data.
RLtools: A Fast, Portable Deep Reinforcement Learning Library for Continuous Control
Deep Reinforcement Learning (RL) can yield capable agents and control policies in several domains but is commonly plagued by prohibitively long training times. Additionally, in the case of continuous control problems, the applicability of learned policies on real-world embedded devices is limited due to the lack of real-time guarantees and portability of existing libraries. To address these challenges, we present RLtools, a dependency-free, header-only, pure C++ library for deep supervised and reinforcement learning.Its novel architecture allows RLtools to be used on a wide variety of platforms, from HPC clusters over workstations and laptops to smartphones, smartwatches, and microcontrollers. Specifically, due to the tight integration of the RL algorithms with simulation environments, RL can solve popular RL problems up to 76 times faster than other popular RL frameworks.We also benchmark the inference on a diverse set of microcontrollers and show that in most cases our optimized implementation is by far the fastest. Finally, RLtools enables the first-ever demonstration of training a deep RL algorithm directly on a microcontroller, giving rise to the field of TinyRL. The source code as well as documentation and live demos are available through our project page at https://rl.tools.
REINFORCEMENT LEARNING FOR INDIVIDUAL OPTIMAL POLICY FROM HETEROGENEOUS DATA
Offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data. Learning from heterogeneous data is one of the fundamental challenges in offline RL. Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes, and thus may result in a suboptimal policy for a heterogeneous population. In this paper, we propose an individualized offline policy optimization framework for heterogeneous time-stationary Markov decision processes (MDPs). The proposed heterogeneous model with individual latent variables enables us to efficiently estimate the individual Q-functions, and our Penalized Pessimistic Personalized Policy Learning (P4L) algorithm guarantees a fast rate on the average regret under a weak partial coverage assumption on behavior policies. In addition, our simulation studies and a real data application demonstrate the superior numerical performance of the proposed method compared with existing methods.
Asymptotic Theory of Geometric and Adaptive $k$-Means Clustering
IMPROVED LEARNING THEORY FOR KERNEL DISTRIBUTION REGRESSION WITH TWO-STAGE SAMPLING
The distribution regression problem encompasses many important statistics and machine learning tasks, and arises in a large range of applications. Among various existing approaches to tackle this problem, kernel methods have become a method of choice. Indeed, kernel distribution regression is both computationally favorable, and supported by a recent learning theory. This theory also tackles the two-stage sampling setting, where only samples from the input distributions are available. In this paper, we improve the learning theory of kernel distribution regression. We address kernels based on Hilbertian embeddings, that encompass most, if not all, of the existing approaches. We introduce the novel near-unbiased condition on the Hilbertian embeddings, that enables us to provide new error bounds on the effect of the two-stage sampling, thanks to a new analysis. We show that this near-unbiased condition holds for three important classes of kernels, based on optimal transport and mean embedding. As a consequence, we strictly improve the existing convergence rates for these kernels. Our setting and results are illustrated by numerical experiments.
Statistical Inference for Decentralized Federated Learning
This paper considers decentralized Federated Learning (FL) under het- erogeneous distributions among distributed clients or data blocks for the M- estimation. The mean squared error and consensus error across the estima- tors from different clients via the decentralized stochastic gradient descent algorithm are derived. The asymptotic normality of the Polyak–Ruppert (PR) averaged estimator in the decentralized distributed setting is attained, which shows that its statistical efficiency comes at a cost as it is more restrictive on the number of clients than that in the distributed M-estimation. To overcome the restriction, a one-step estimator is proposed which permits a much larger number of clients while still achieving the same efficiency as the original PR-averaged estimator in the nondistributed setting. The confidence regions based on both the PR-averaged estimator and the proposed one-step estimator are constructed to facilitate statistical inference for decentralized FL.
On the Convergence of Projected Policy Gradient for Any Constant Step Sizes
Projected policy gradient (PPG) is a basic policy optimization method in reinforcement learning. Given access to exact policy evaluations, previous studies have established the sublinear convergence of PPG for sufficiently small step sizes based on the smoothness and the gradient domination properties of the value function. However, as the step size goes to infinity, PPG reduces to the classic policy iteration method, which suggests the convergence of PPG even for large step sizes. In this paper, we fill this gap and show that PPG admits a sublinear convergence for any constant step sizes. Due to the existence of the state-wise visitation measure in the expression of policy gradient, the existing optimization-based analysis framework for a preconditioned version of PPG (i.e., projected Q-ascent) is not applicable, to the best of our knowledge. Instead, we proceed the proof by computing the state-wise improvement lower bound of PPG based on its inherent structure. In addition, the finite iteration convergence of PPG for any constant step size is further established, which is also new.
Deep Nonlinear Sufficient Dimension Reduction
A duality framework for analyzing random feature and two-layer neural networks
A Statistical Framework of Watermarks for Large Language Models: Pivot, Detection Efficiency and Optimal Rules
Since ChatGPT was introduced in November 2022, embedding (nearly) unnoticeable statistical signals into text generated by large language models (LLMs), also known as watermarking, has been used as a principled approach to provable detection of LLM-generated text from its human-written counterpart. In this paper, we introduce a general and flexible framework for reasoning about the statistical efficiency of watermarks and designing powerful detection rules. Inspired by the hypothesis testing formulation of watermark detection, our framework starts by selecting a pivotal statistic of the text and a secret key---provided by the LLM to the verifier---to control the false positive rate (the error of mistakenly detecting human-written text as LLM-generated). Next, this framework allows one to evaluate the power of watermark detection rules by obtaining a closed-form expression of the asymptotic false negative rate (the error of incorrectly classifying LLM-generated text as human-written). Our framework further reduces the problem of determining the optimal detection rule to solving a minimax optimization program. We apply this framework to two representative watermarks---one of which has been internally implemented at OpenAI---and obtain several findings that can be instrumental in guiding the practice of implementing watermarks. In particular, we derive optimal detection rules for these watermarks under our framework. These theoretically derived detection rules are demonstrated to be competitive and sometimes enjoy a higher power than existing detection approaches through numerical experiments.
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks
Interpretable Global Minima of Deep ReLU Neural Networks on Sequentially Separable Data
Unified Algorithms for RL with Decision-Estimation Coefficients: PAC, Reward-Free, Preference-Based Learning, and Beyond
Modern Reinforcement Learning (RL) is more than just learning the optimal policy; Alternative learning goals such as exploring the environment, estimating the underlying model, and learning from preference feedback are all of practical importance. While provably sample-efficient algorithms for each specific goal have been proposed, these algorithms often depend strongly on the particular learning goal and thus admit different structures correspondingly. It is an urging open question whether these learning goals can rather be tackled by a single unified algorithm. We make progress on this question by developing a unified algorithm framework for a large class of learning goals, building on the Decision-Estimation Coefficient (DEC) framework. Our framework handles many learning goals such as no-regret RL, PAC RL, reward-free learning, model estimation, and preference-based learning, all by simply instantiating the same generic complexity measure called "Generalized DEC", and a corresponding generic algorithm. The generalized DEC also yields a sample complexity lower bound for each specific learning goal. As applications, we propose "decouplable representation" as a natural sufficient condition for bounding generalized DECs, and use it to obtain many new sample-efficient results (and recover existing results) for a wide range of learning goals and problem classes as direct corollaries. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling and Maximum Likelihood Estimation, showing that they enjoy sample complexity bounds under similar structural conditions as the DEC.
Data-Driven Performance Guarantees for Classical and Learned Optimizers
We introduce a data-driven approach to analyze the performance of continuous optimization algorithms using generalization guarantees from statistical learning theory. We study classical and learned optimizers to solve families of parametric optimization problems. We build generalization guarantees for classical optimizers, using a sample convergence bound, and for learned optimizers, using the Probably Approximately Correct (PAC)-Bayes framework. To train learned optimizers, we use a gradient-based algorithm to directly minimize the PAC-Bayes upper bound. Numerical experiments in signal processing, control, and meta-learning showcase the ability of our framework to provide strong generalization guarantees for both classical and learned optimizers given a fixed budget of iterations. For classical optimizers, our bounds which hold with high probability are much tighter than those that worst-case guarantees provide. For learned optimizers, our bounds outperform the empirical outcomes observed in their non-learned counterparts.
Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case
Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.
Policy learning “without” overlap: Pessimism and generalized empirical Bernstein’s inequality
This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn an optimal individualized decision rule that achieves the best overall outcomes for a given population. Existing policy learning methods rely on a uniform overlap assumption, that is, the propensities of exploring all actions for all individual characteristics must be lower bounded; the performance of the existing methods thus depends on the worst-case propensity in the offline data set. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities. In this paper, we propose Pessimistic Policy Learning (PPL), a new algorithm that optimizes lower confidence bounds (LCBs)—instead of point estimates—of the policy values. The LCBs are constructed using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which only depends on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class we optimize over. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized type concentration inequality for inverse-propensity-weighting estimators, generalizing the well-known empirical Bernstein’s inequality to unbounded and non-i.i.d. data. We complement our theory with an efficient optimization algorithm via Majorization–Minimization and policy tree search, as well as extensive simulation studies and real-world applications that demonstrate the efficacy of PPL.
Pseudo-Labeling for Kernel Ridge Regression under Covariate Shift
We develop and analyze a principled approach to kernel ridge regression under covariate shift. The goal is to learn a regression function with small mean squared error over a target distribution, based on unlabeled data from there and labeled data that may have a different feature distribution. We propose to split the labeled data into two subsets, and conduct kernel ridge regression on them separately to obtain a collection of candidate models and an imputation model. We use the latter to fill the missing labels and then select the best candidate accordingly. Our non-asymptotic excess risk bounds demonstrate that our estimator adapts effectively to both the structure of the target distribution and the covariate shift. This adaptation is quantified through a notion of effective sample size that reflects the value of labeled source data for the target regression task. Our estimator achieves the minimax optimal error rate up to a polylogarithmic factor, and we find that using pseudo-labels for model selection does not significantly hinder performance.
A Geometrical Analysis of Kernel Ridge Regression and its Applications
We obtain upper bounds for the estimation error of Kernel Ridge Regression (KRR) for all non-negative regularization parameters, offering a geometric perspective on various phenomena in KRR. As applications: 1. We address the multiple descent problem, unifying the proofs of arXiv:1908.10292 and arXiv:1904.12191 for polynomial kernels and we establish multiple descent for the upper bound of estimation error of KRR under sub-Gaussian design and non-asymptotic regimes. 2. For a sub-Gaussian design vector and for non-asymptotic scenario, we prove a one-sided isomorphic version of the Gaussian Equivalent Conjecture. 3. We offer a novel perspective on the linearization of kernel matrices of non-linear kernel, extending it to the power regime for polynomial kernels. 4. Our theory is applicable to data-dependent kernels, providing a convenient and accurate tool for the feature learning regime in deep learning theory. 5. Our theory extends the results in arXiv:2009.14286 under weak moment assumption. Our proof is based on three mathematical tools developed in this paper that can be of independent interest: 1. Dvoretzky-Milman theorem for ellipsoids under (very) weak moment assumptions. 2. Restricted Isomorphic Property in Reproducing Kernel Hilbert Spaces with embedding index conditions. 3. A concentration inequality for finite-degree polynomial kernel functions.
Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds
Estimation and Inference in Distributional Reinforcement Learning
In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete return distribution (denoted \eta^\pi) attained by a given policy \pi. We use the certainty-equivalence method to construct our estimator \hat\eta^\pin, based on a generative model. In this circumstance we need a dataset of size \widetilde O(|\mathcal{S}||\mathcal{A}|{\varepsilon^{-2p}(1-\gamma)^{-(2p+2)}}) to guarantee the supremum p-Wasserstein metric between \hat\eta^\pin and \eta^\pi less than \varepsilon with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size \widetilde O({|\mathcal{S}||\mathcal{A}|}{\varepsilon^{-2}(1-\gamma)^{-4}}) suffices to ensure the supremum Kolmogorov-Smirnov metric and supremum total variation metric between \hat\eta^\pin and \eta^\pi is below \varepsilon with high probability. Furthermore, we investigate the asymptotic behavior of \hat\eta^\pin. We demonstrate that the ``empirical process'' \sqrt{n}(\hat\eta^\pin-\eta^\pi) converges weakly to a Gaussian process in the space of bounded functionals on a Lipschitz function class \ell^\infty(\mathcal{F}{W1}), also in the space of bounded functionals on an indicator function class \ell^\infty(\mathcal{F}{\textup{KS}}) and a bounded measurable function class \ell^\infty(\mathcal{F}_{\textup{TV}}) when some mild conditions hold.
Uniform Generalization Bounds on Data-Dependent Hypothesis Sets via PAC-Bayesian Theory on Random Sets
We propose data-dependent uniform generalization bounds by approaching the problem from a PAC-Bayesian perspective. We first apply the PAC-Bayesian framework on ``random sets'' in a rigorous way, where the training algorithm is assumed to output a data-dependent hypothesis set after observing the training data. This approach allows us to prove data-dependent bounds, which can be applicable in numerous contexts. To highlight the power of our approach, we consider two main applications. First, we propose a PAC-Bayesian formulation of the recently developed fractal-dimension-based generalization bounds. The derived results are shown to be tighter and they unify the existing results around one simple proof technique. Second, we prove uniform bounds over the trajectories of continuous Langevin dynamics and stochastic gradient Langevin dynamics. These results provide novel information about the generalization properties of noisy algorithms.
Neural Networks Generalize on Low Complexity Data
Online Estimation and Inference for Robust Policy Evaluation in Reinforcement Learning
Reinforcement learning has emerged as one of the prominent topics attracting attention in modern statistical learning, with policy evaluation being a key component. Unlike the traditional machine learning literature on this topic, our work emphasizes statistical inference for the model parameters and value functions of reinforcement learning algorithms. While most existing analyses assume random rewards to follow standard distributions, we embrace the concept of robust statistics in reinforcement learning by simultaneously addressing issues of outlier contamination and heavy-tailed rewards within a unified framework. In this paper, we develop a fully online robust policy evaluation procedure, and establish the Bahadur-type representation of our estimator. Furthermore, we develop an online procedure to efficiently conduct statistical inference based on the asymptotic distribution. This paper connects robust statistics and statistical inference in reinforcement learning, offering a more versatile and reliable approach to online policy evaluation. Finally, we validate the efficacy of our algorithm through numerical experiments conducted in simulations and real-world reinforcement learning experiments.
Stochastic-Constrained Stochastic Optimization with Markovian Data
This paper considers stochastic-constrained stochastic optimization where the stochastic constraint is to satisfy that the expectation of a random function is below a certain threshold. In particular, we study the setting where data samples are drawn from a Markov chain and thus are not independent and identically distributed. We generalize the drift-plus-penalty framework, a primal-dual stochastic gradient method developed for the i.i.d. case, to the Markov chain sampling setting. We propose three variants of drift-plus-penalty; two are for the case when the mixing time of the underlying Markov chain is known while the other is for the case of unknown mixing time. In fact, our algorithms apply to a more general setting of constrained online convex optimization where the sequence of constraint functions follows a Markov chain. The algorithms are adaptive in that the first two work without knowledge of the time horizon while the third uses AdaGrad-style algorithm parameters, which is of independent interest. We demonstrate the effectiveness of our proposed methods through numerical experiments on classification with fairness constraints.
Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces II: non-compact symmetric spaces
Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.
Geometric Learning with Positively Decomposable Kernels
Kernel methods are powerful tools in machine learning. Classical kernel methods are based on positive definite kernels, which enable learning in reproducing kernel Hilbert spaces (RKHS). For non-Euclidean data spaces, positive definite kernels are difficult to come by. In this case, we propose the use of reproducing kernel Krein space (RKKS) based methods, which require only kernels that admit a positive decomposition. We show that one does not need to access this decomposition to learn in RKKS. We then investigate the conditions under which a kernel is positively decomposable. We show that invariant kernels admit a positive decomposition on homogeneous spaces under tractable regularity assumptions. This makes them much easier to construct than positive definite kernels, providing a route for learning with kernels for non-Euclidean data. By the same token, this provides theoretical foundations for RKKS-based methods in general.
Fine-grained Analysis and Faster Algorithms for Iteratively Solving Linear Systems
Online Statistical Inference in Decision Making with Matrix Context
The study of online decision-making problems that leverage contextual information has drawn notable attention due to their significant applications in fields ranging from healthcare to autonomous systems. In modern applications, contextual information can be rich and is often represented as a matrix. Moreover, while existing online decision algorithms mainly focus on reward maximization, less attention has been devoted to statistical inference. To address these gaps, in this work, we consider an online decision-making problem with a matrix context where the true model parameters have a lowrank structure. We propose a fully online procedure to conduct statistical inference with adaptively collected data. The low-rank structure of the model parameter and the adaptive nature of the data collection process make this difficult: standard low-rank estimators are biased and cannot be obtained in a sequential manner while existing inference approaches in sequential decision making algorithms fail to account for the low-rankness and are also biased. To overcome these challenges, we introduce a new online debiasing procedure to simultaneously handle both sources of bias. Our inference framework encompasses both parameter inference and optimal policy value inference. In theory, we establish the asymptotic normality of the proposed online debiased estimators and prove the validity of the constructed confidence intervals for both inference tasks. Our inference results are built upon a newly developed low-rank stochastic gradient descent estimator and its convergence result, which are also of independent interest.
TESTING STATIONARITY AND CHANGE POINT DETECTION IN REINFORCEMENT LEARNING
We consider reinforcement learning (RL) in possibly nonstationary environments. Many existing RL algorithms in the literature rely on the stationarity assumption that requires the state transition and reward functions to be constant over time. However, this assumption is restrictive in practice and is likely to be violated in a number of applications, including traffic signal control, robotics and mobile health. In this paper, we develop a modelfree test to assess the stationarity of the optimal Q-function based on pre-collected historical data, without additional online data collection. Based on the proposed test, we further develop a change point detection method that can be naturally coupled with existing state-of-the-art RL methods designed in stationary environments for online policy optimization in nonstationary environments. The usefulness of our method is illustrated by theoretical results, simulation studies, and a real data example from the 2018 Intern Health Study. A Python implementation of the proposed procedure is publicly available at https://github.com/limengbinggz/CUSUM-RL.
Dropout Regularization Versus l2-Penalization in the Linear Model
Robust Transfer Learning with Unreliable Source Data
This paper addresses challenges in robust transfer learning stemming from ambiguity in Bayes classifiers and weak transferable signals between the target and source distribution. We introduce a novel quantity called the ''ambiguity level'' that measures the discrepancy between the target and source regression functions, propose a simple transfer learning procedure, and establish a general theorem that shows how this new quantity is related to the transferability of learning in terms of risk improvements. Our proposed ''Transfer Around Boundary'' (TAB) model, with a threshold balancing the performance of target and source data, is shown to be both efficient and robust, improving classification while avoiding negative transfer. Moreover, we demonstrate the effectiveness of the TAB model on non-parametric classification and logistic regression tasks, achieving upper bounds which are optimal up to logarithmic factors. Simulation studies lend further support to the effectiveness of TAB. We also provide simple approaches to bound the excess misclassification error without the need for specialized knowledge in transfer learning.
The ODE Method for Stochastic Approximation and Reinforcement Learning with Markovian Noise
Stochastic approximation is a class of algorithms that update a vector iteratively, incrementally, and stochastically, including, e.g., stochastic gradient descent and temporal difference learning. One fundamental challenge in analyzing a stochastic approximation algorithm is to establish its stability, i.e., to show that the stochastic vector iterates are bounded almost surely. In this paper, we extend the celebrated Borkar-Meyn theorem for stability from the Martingale difference noise setting to the Markovian noise setting, which greatly improves its applicability in reinforcement learning, especially in those off-policy reinforcement learning algorithms with linear function approximation and eligibility traces. Central to our analysis is the diminishing asymptotic rate of change of a few functions, which is implied by both a form of the strong law of large numbers and a form of the law of the iterated logarithm.