`

Timezone: »

Doubly Pessimistic Algorithms for Strictly Safe Off-Policy Optimization
Sanae Amani · Lin Yang
Safety in reinforcement learning (RL) has become increasingly important in recent years. Yet, many of existing solutions fail to strictly avoid choosing unsafe actions, which may lead to catastrophic results in safety-critical systems. In this paper, we study offline RL in the presence of safety requirements: from a dataset collected a priori and without direct access to the true environment, learn an optimal policy that is guaranteed to respect the safety constraints. We first address this problem by modeling the safety requirement as an unknown cost function of states and actions, whose expected value with respect to the policy must fall below a certain threshold. We then present an algorithm in the context of finite-horizon Markov decision processes (MDPs), termed Safe-DPVI that performs in a doubly pessimistic manner when 1) it constructs a conservative set of safe policies; and 2) when it selects a good policy from that conservative set. Without assuming the sufficient coverage of the dataset or any structure for the underlying MDPs, we establish a data-dependent upper bound on the suboptimality gap of the \emph{safe} policy Safe-DPVI returns. We then specialize our results to linear MDPs with appropriate assumptions on dataset being well-explored. Both data-dependent and specialized upper bounds nearly match that of state-of-the-art unsafe offline RL algorithms, with an additional multiplicative factor $\frac{\sum_{h=1}^H\alpha_{h}}{H}$, where $\alpha_h$ characterizes the safety constraint at time-step $h$. We further present numerical simulations that corroborate our theoretical findings.