Timezone: »
Poster
Defining and Characterizing Reward Gaming
Joar Skalse · Nikolaus Howe · Dmitrii Krasheninnikov · David Krueger
We provide the first formal definition of \textbf{reward hacking}, a phenomenon where optimizing an imperfect proxy reward function, $\mathcal{\tilde{R}}$, leads to poor performance according to the true reward function, $\mathcal{R}$. We say that a proxy is \textbf{unhackable} if increasing the expected proxy return can never decrease the expected true return.Intuitively, it might be possible to create an unhackable proxy by leaving some terms out of the reward function (making it ``narrower'') or overlooking fine-grained distinctions between roughly equivalent outcomes, but we show this is usually not the case.A key insight is that the linearity of reward (in state-action visit counts) makes unhackability a very strong condition. In particular, for the set of all stochastic policies, two reward functions can only be unhackable if one of them is constant.We thus turn our attention to deterministic policies and finite sets of stochastic policies, where non-trivial unhackable pairs always exist, and establish necessary and sufficient conditions for the existence of simplifications, an important special case of unhackability.Our results reveal a tension between using reward functions to specify narrow tasks and aligning AI systems with human values.
Author Information
Joar Skalse (University of Oxford)
Nikolaus Howe (Mila, Université de Montréal)
Dmitrii Krasheninnikov (University of Cambridge)
David Krueger (University of Cambridge)
More from the Same Authors
-
2021 Spotlight: Reinforcement Learning in Newcomblike Environments »
James Bell · Linda Linsefors · Caspar Oesterheld · Joar Skalse -
2021 : Multi-Domain Balanced Sampling Improves Out-of-Distribution Generalization of Chest X-ray Pathology Prediction Models »
Enoch Tetteh · David Krueger · Joseph Paul Cohen · Yoshua Bengio -
2022 : Domain Generalization for Robust Model-Based Offline Reinforcement Learning »
Alan Clark · Shoaib Siddiqui · Robert Kirk · Usman Anwar · Stephen Chung · David Krueger -
2022 : Mechanistic Lens on Mode Connectivity »
Ekdeep S Lubana · Eric Bigelow · Robert Dick · David Krueger · Hidenori Tanaka -
2022 : Domain Generalization for Robust Model-Based Offline RL »
Alan Clark · Shoaib Siddiqui · Robert Kirk · Usman Anwar · Stephen Chung · David Krueger -
2022 : On The Fragility of Learned Reward Functions »
Lev McKinney · Yawen Duan · Adam Gleave · David Krueger -
2022 : Training Equilibria in Reinforcement Learning »
Lauro Langosco · David Krueger · Adam Gleave -
2022 : Assistance with large language models »
Dmitrii Krasheninnikov · Egor Krasheninnikov · David Krueger -
2022 : Misspecification in Inverse Reinforcement Learning »
Joar Skalse · Alessandro Abate -
2022 : The Reward Hypothesis is False »
Joar Skalse · Alessandro Abate -
2022 : A general framework for reward function distances »
Erik Jenner · Joar Skalse · Adam Gleave -
2022 : All’s Well That Ends Well: Avoiding Side Effects with Distance-Impact Penalties »
Charlie Griffin · Joar Skalse · Lewis Hammond · Alessandro Abate -
2022 : Assistance with large language models »
Dmitrii Krasheninnikov · Egor Krasheninnikov · David Krueger -
2022 : Unifying Grokking and Double Descent »
Xander Davies · Lauro Langosco · David Krueger -
2023 Poster: Thinker: Learning to Plan and Act »
Stephen Chung · Ivan Anokhin · David Krueger -
2023 Workshop: Socially Responsible Language Modelling Research (SoLaR) »
Usman Anwar · David Krueger · Samuel Bowman · Jakob Foerster · Su Lin Blodgett · Roberta Raileanu · Alan Chan · Katherine Lee · Laura Ruis · Robert Kirk · Yawen Duan · Xin Chen · Kawin Ethayarajh -
2022 Poster: Myriad: a real-world testbed to bridge trajectory optimization and deep learning »
Nikolaus Howe · Simon Dufort-Labbé · Nitarshan Rajkumar · Pierre-Luc Bacon -
2021 Poster: Reinforcement Learning in Newcomblike Environments »
James Bell · Linda Linsefors · Caspar Oesterheld · Joar Skalse -
2019 : Poster Session »
Rishav Chourasia · Yichong Xu · Corinna Cortes · Chien-Yi Chang · Yoshihiro Nagano · So Yeon Min · Benedikt Boecking · Phi Vu Tran · Kamyar Ghasemipour · Qianggang Ding · Shouvik Mani · Vikram Voleti · Rasool Fakoor · Miao Xu · Kenneth Marino · Lisa Lee · Volker Tresp · Jean-Francois Kagy · Marvin Zhang · Barnabas Poczos · Dinesh Khandelwal · Adrien Bardes · Evan Shelhamer · Jiacheng Zhu · Ziming Li · Xiaoyan Li · Dmitrii Krasheninnikov · Ruohan Wang · Mayoore Jaiswal · Emad Barsoum · Suvansh Sanjeev · Theeraphol Wattanavekin · Qizhe Xie · Sifan Wu · Yuki Yoshida · David Kanaa · Sina Khoshfetrat Pakazad · Mehdi Maasoumy