We study a safe reinforcement learning problem in which the constraints are defined as the expected cost over finite-length trajectories. We propose a constrained cross-entropy-based method to solve this problem. The method explicitly tracks its performance with respect to constraint satisfaction and thus is well-suited for safety-critical applications. We show that the asymptotic behavior of the proposed algorithm can be almost-surely described by that of an ordinary differential equation. Then we give sufficient conditions on the properties of this differential equation to guarantee the convergence of the proposed algorithm. At last, we show with simulation experiments that the proposed algorithm can effectively learn feasible policies without assumptions on the feasibility of initial policies, even with non-Markovian objective functions and constraint functions.
Min Wen (University of Pennsylvania)
Ufuk Topcu (The University of Texas at Austin)
More from the Same Authors
2023 Poster: Task-aware Distributed Source Coding under Dynamic Bandwidth »
Po-han Li · Sravan Kumar Ankireddy · Ruihan Zhao · Hossein Nourkhiz Mahjoub · Ehsan Moradi Pari · Ufuk Topcu · Sandeep Chinchali · Hyeji Kim
2022 Poster: Class-Aware Adversarial Transformers for Medical Image Segmentation »
Chenyu You · Ruihan Zhao · Fenglin Liu · Siyuan Dong · Sandeep Chinchali · Ufuk Topcu · Lawrence Staib · James Duncan