DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction
Aviral Kumar, Abhishek Gupta, Sergey Levine
Spotlight presentation: Orals & Spotlights Track 04: Reinforcement Learning
on 2020-12-07T19:20:00-08:00 - 2020-12-07T19:30:00-08:00
on 2020-12-07T19:20:00-08:00 - 2020-12-07T19:30:00-08:00
Poster Session 1 (more posters)
on 2020-12-07T21:00:00-08:00 - 2020-12-07T23:00:00-08:00
GatherTown: Reinforcement learning and planning ( Town D0 - Spot A1 )
on 2020-12-07T21:00:00-08:00 - 2020-12-07T23:00:00-08:00
GatherTown: Reinforcement learning and planning ( Town D0 - Spot A1 )
Join GatherTown
Only iff poster is crowded, join Zoom . Authors have to start the Zoom call from their Profile page / Presentation History.
Only iff poster is crowded, join Zoom . Authors have to start the Zoom call from their Profile page / Presentation History.
Toggle Abstract Paper (in Proceedings / .pdf)
Abstract: Deep reinforcement learning can learn effective policies for a wide range of tasks, but is notoriously difficult to use due to instability and sensitivity to hyperparameters. The reasons for this remain unclear. In this paper, we study how RL methods based on bootstrapping-based Q-learning can suffer from a pathological interaction between function approximation and the data distribution used to train the Q-function: with standard supervised learning, online data collection should induce corrective feedback, where new data corrects mistakes in old predictions. With dynamic programming methods like Q-learning, such feedback may be absent. This can lead to potential instability, sub-optimal convergence, and poor results when learning from noisy, sparse or delayed rewards. Based on these observations, we propose a new algorithm, DisCor, which explicitly optimizes for data distributions that can correct for accumulated errors in the value function. DisCor computes a tractable approximation to the distribution that optimally induces corrective feedback, which we show results in reweighting samples based on the estimated accuracy of their target values. Using this distribution for training, DisCor results in substantial improvements in a range of challenging RL settings, such as multi-task learning and learning from noisy reward signals.