Timezone: »

Learning Guidance Rewards with Trajectory-space Smoothing
Tanmay Gangwani · Yuan Zhou · Jian Peng

Tue Dec 08 09:00 PM -- 11:00 PM (PST) @ Poster Session 2 #597

Long-term temporal credit assignment is an important challenge in deep reinforcement learning (RL). It refers to the ability of the agent to attribute actions to consequences that may occur after a long time interval. Existing policy-gradient and Q-learning algorithms typically rely on dense environmental rewards that provide rich short-term supervision and help with credit assignment. However, they struggle to solve tasks with delays between an action and the corresponding rewarding feedback. To make credit assignment easier, recent works have proposed algorithms to learn dense "guidance" rewards that could be used in place of the sparse or delayed environmental rewards. This paper is in the same vein -- starting with a surrogate RL objective that involves smoothing in the trajectory-space, we arrive at a new algorithm for learning guidance rewards. We show that the guidance rewards have an intuitive interpretation, and can be obtained without training any additional neural networks. Due to the ease of integration, we use the guidance rewards in a few popular algorithms (Q-learning, Actor-Critic, Distributional-RL) and present results in single-agent and multi-agent tasks that elucidate the benefit of our approach when the environmental rewards are sparse or delayed.

Author Information

Tanmay Gangwani (University of Illinois, Urbana-Champaign)

I am a Ph.D. student in Computer Science at the University of Illinois, Urbana Champaign, supervised by Jian Peng. I'm interested in machine learning, especially Reinforcement Learning. My research is mainly focused on designing algorithms which efficiently leverage expert demonstrations for RL (imitation learning), address the exploration challenge in complex environment, and use generative modeling methods for model-based RL. For details, please visit https://tgangwani.github.io

Yuan Zhou (UIUC)
Jian Peng (University of Illinois at Urbana-Champaign)

More from the Same Authors