Timezone: »

Redeeming intrinsic rewards via constrained policy optimization
Eric Chen · Zhang-Wei Hong · Joni Pajarinen · Pulkit Agrawal

Thu Dec 01 02:00 PM -- 04:00 PM (PST) @ Hall J #120
State-of-the-art reinforcement learning (RL) algorithms typically use random sampling (e.g., $\epsilon$-greedy) for exploration, but this method fails in hard exploration tasks like Montezuma's Revenge. To address the challenge of exploration, prior works incentivize the agent to visit novel states using an exploration bonus (also called intrinsic rewards), which led to excellent results on some hard exploration tasks. However, recent studies show that on many other tasks intrinsic rewards can bias policy optimization leading to poor performance compared to optimizing only the environment reward. The low-performance results from the agent seeking intrinsic rewards and performing unnecessary exploration even when sufficient environment reward is provided. This inconsistency in performance across tasks prevents widespread use of intrinsic rewards with RL algorithms. We propose a principled constrained policy optimization procedure to eliminate the detrimental effects of intrinsic rewards while preserving their merits when applicable. Our method automatically tunes the importance of intrinsic reward: it suppresses intrinsic rewards when they are not needed and increases them when exploration is required. The end result is a superior exploration algorithm that does not require manual tuning to balance intrinsic rewards against environment rewards. Experimental results across 61 Atari games validate our claim.

Author Information

Eric Chen (Massachusetts Institute of Technology)
Zhang-Wei Hong (MIT)
Joni Pajarinen (Aalto University)
Pulkit Agrawal (MIT)

More from the Same Authors