Skip to yearly menu bar Skip to main content


SafeDICE: Offline Safe Imitation Learning with Non-Preferred Demonstrations

Youngsoo Jang · Geon-Hyeong Kim · Jongmin Lee · Sungryull Sohn · Byoungjip Kim · Honglak Lee · Moontae Lee

Great Hall & Hall B1+B2 (level 1) #1412
[ ]
Thu 14 Dec 3 p.m. PST — 5 p.m. PST


We consider offline safe imitation learning (IL), where the agent aims to learn the safe policy that mimics preferred behavior while avoiding non-preferred behavior from non-preferred demonstrations and unlabeled demonstrations. This problem setting corresponds to various real-world scenarios, where satisfying safety constraints is more important than maximizing the expected return. However, it is very challenging to learn the policy to avoid constraint-violating (i.e. non-preferred) behavior, as opposed to standard imitation learning which learns the policy to mimic given demonstrations. In this paper, we present a hyperparameter-free offline safe IL algorithm, SafeDICE, that learns safe policy by leveraging the non-preferred demonstrations in the space of stationary distributions. Our algorithm directly estimates the stationary distribution corrections of the policy that imitate the demonstrations excluding the non-preferred behavior. In the experiments, we demonstrate that our algorithm learns a more safe policy that satisfies the cost constraint without degrading the reward performance, compared to baseline algorithms.

Chat is not available.