Workshop: Offline Reinforcement Learning

What Would the Expert $do(\cdot)$?: Causal Imitation Learning

Gokul Swamy · Sanjiban Choudhury · James Bagnell · Steven Wu


We develop algorithms for imitation learning from policy data that was corrupted by unobserved confounders. Sources of such confounding include (a) persistent perturbations to actions or (b) the expert responding to a part of the state that the learner does not have access to. When a confounder affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch on to, leading to poor policy performance. To break up these spurious correlations, we apply modern variants of the classical instrumental variable regression (IVR) technique, enabling us to recover the causally correct underlying policy without requiring access to an interactive expert. In particular, we present two techniques, one of a generative-modeling flavor (DoubIL) that can utilize access to a simulator and one of a game-theoretic flavor (ResiduIL) that can be run entirely offline. We discuss, from the perspective of performance, the types of confounding under which it is better to use an IVR-based technique instead of behavioral cloning and vice versa. We find both of our algorithms compare favorably to behavioral cloning on a simulated rocket landing task.

Chat is not available.