Timezone: »
In partially observable environments, reinforcement learning algorithms such as policy gradient and Q-learning may have multiple equilibria---policies that are stable under further training---and can converge to equilibria that are strictly suboptimal. Prior work blames insufficient exploration, but suboptimal equilibria can arise despite full exploration and other favorable circumstances like a flexible policy parametrization.We show theoretically that the core problem is that in partially observed environments, an agent's past actions induce a distribution on hidden states.Equipping the policy with memory helps it model the hidden state and leads to convergence to a higher reward equilibrium, \emph{even when there exists a memoryless optimal policy}.Experiments show that policies with insufficient memory tend to learn to use the environment as auxiliary memory, and parameter noise helps policies escape suboptimal equilibria.
Author Information
Lauro Langosco (University of Cambridge)
David Krueger (University of Cambridge)
Adam Gleave (UC Berkeley)
More from the Same Authors
-
2021 : Multi-Domain Balanced Sampling Improves Out-of-Distribution Generalization of Chest X-ray Pathology Prediction Models »
Enoch Tetteh · David Krueger · Joseph Paul Cohen · Yoshua Bengio -
2021 : Preprocessing Reward Functions for Interpretability »
Erik Jenner · Adam Gleave -
2022 : Domain Generalization for Robust Model-Based Offline Reinforcement Learning »
Alan Clark · Shoaib Siddiqui · Robert Kirk · Usman Anwar · Stephen Chung · David Krueger -
2022 : Mechanistic Lens on Mode Connectivity »
Ekdeep S Lubana · Eric Bigelow · Robert Dick · David Krueger · Hidenori Tanaka -
2022 : Domain Generalization for Robust Model-Based Offline RL »
Alan Clark · Shoaib Siddiqui · Robert Kirk · Usman Anwar · Stephen Chung · David Krueger -
2022 : On The Fragility of Learned Reward Functions »
Lev McKinney · Yawen Duan · Adam Gleave · David Krueger -
2022 : Adversarial Policies Beat Professional-Level Go AIs »
Tony Wang · Adam Gleave · Nora Belrose · Tom Tseng · Michael Dennis · Yawen Duan · Viktor Pogrebniak · Joseph Miller · Sergey Levine · Stuart J Russell -
2022 : Assistance with large language models »
Dmitrii Krasheninnikov · Egor Krasheninnikov · David Krueger -
2022 : Adversarial Policies Beat Professional-Level Go AIs »
Tony Wang · Adam Gleave · Nora Belrose · Tom Tseng · Michael Dennis · Yawen Duan · Viktor Pogrebniak · Joseph Miller · Sergey Levine · Stuart Russell -
2022 : On The Fragility of Learned Reward Functions »
Lev McKinney · Yawen Duan · David Krueger · Adam Gleave -
2022 : Assistance with large language models »
Dmitrii Krasheninnikov · Egor Krasheninnikov · David Krueger -
2022 : Unifying Grokking and Double Descent »
Xander Davies · Lauro Langosco · David Krueger -
2023 Poster: Thinker: Learning to Plan and Act »
Stephen Chung · Ivan Anokhin · David Krueger -
2023 Workshop: Socially Responsible Language Modelling Research (SoLaR) »
Usman Anwar · David Krueger · Samuel Bowman · Jakob Foerster · Su Lin Blodgett · Roberta Raileanu · Alan Chan · Katherine Lee · Laura Ruis · Robert Kirk · Yawen Duan · Xin Chen · Kawin Ethayarajh -
2022 Poster: Defining and Characterizing Reward Gaming »
Joar Skalse · Nikolaus Howe · Dmitrii Krasheninnikov · David Krueger