The Oversight Game: Learning AI Control and Corrigibility in Markov Games
Abstract
As increasingly capable agents are deployed, a central safety question is how to retain meaningful human control without modifying the underlying system. We study a minimal interface where an agent chooses autonomously (play) or defers (ask), while a human simultaneously chooses to be permissive (trust) or to engage (oversee), which can trigger a correction. We model this as a two-player Markov Game, focusing particularly on the case where it qualifies as a Markov Potential Game (MPG). We show the MPG structure yields a powerful alignment guarantee: under a structural assumption on the human's value, any decision by the agent to act more autonomously that benefits itself cannot harm the human's value. This model provides a transparent control layer where the agent learns to defer when risky and act when safe, while its pretrained policy remains untouched. Our gridworld simulation shows that through independent learning, an emergent collaboration avoids safety violations, demonstrating a practical method for making misaligned models safer after deployment.