Timezone: »
Structural credit assignment in neural networks is a long-standing problem, with a variety of alternatives to backpropagation proposed to allow for local training of nodes. One of the early strategies was to treat each node as an agent and use a reinforcement learning method called REINFORCE to update each node locally with only a global reward signal. In this work, we revisit this approach and investigate if we can leverage other reinforcement learning approaches to improve learning. We first formalize training a neural network as a finite-horizon reinforcement learning problem and discuss how this facilitates using ideas from reinforcement learning like off-policy learning. We show that the standard on-policy REINFORCE approach, even with a variety of variance reduction approaches, learns suboptimal solutions. We introduce an off-policy approach, to facilitate reasoning about the greedy action for other agents and help overcome stochasticity in other agents. We conclude by showing that these networks of agents can be more robust to correlated samples when learning online.
Author Information
Dhawal Gupta (University of Alberta)
Gabor Mihucz (University of Alberta)
Matthew Schlegel (University of Alberta)
An AI and coffee enthusiast with research experience in RL and ML. Currently pursuing a PhD at the University of Alberta! Excited about off-policy policy evaluation, general value functions, understanding the behavior of artificial neural networks, and cognitive science (specifically cognitive neuroscience).
James Kostas (University of Massachusetts, Amherst)
Philip Thomas (University of Massachusetts Amherst)
Martha White
More from the Same Authors
-
2022 : Optimization using Parallel Gradient Evaluations on Multiple Parameters »
Yash Chandak · Shiv Shankar · Venkata Gandikota · Philip Thomas · Arya Mazumdar -
2022 Poster: Off-Policy Evaluation for Action-Dependent Non-stationary Environments »
Yash Chandak · Shiv Shankar · Nathaniel Bastian · Bruno da Silva · Emma Brunskill · Philip Thomas -
2021 : Q&A for Philip Thomas »
Philip Thomas -
2021 : Advances in (High-Confidence) Off-Policy Evaluation »
Philip Thomas -
2021 : Invited Speaker Panel »
Sham Kakade · Minmin Chen · Philip Thomas · Angela Schoellig · Barbara Engelhardt · Doina Precup · George Tucker -
2021 Poster: SOPE: Spectrum of Off-Policy Estimators »
Christina Yuan · Yash Chandak · Stephen Giguere · Philip Thomas · Scott Niekum -
2021 Poster: Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety Constraints in Finite MDPs »
harsh satija · Philip Thomas · Joelle Pineau · Romain Laroche -
2021 Poster: Continual Auxiliary Task Learning »
Matthew McLeod · Chunlok Lo · Matthew Schlegel · Andrew Jacobsen · Raksha Kumaraswamy · Martha White · Adam White -
2021 Poster: Universal Off-Policy Evaluation »
Yash Chandak · Scott Niekum · Bruno da Silva · Erik Learned-Miller · Emma Brunskill · Philip Thomas -
2020 Tutorial: (Track3) Policy Optimization in Reinforcement Learning Q&A »
Sham M Kakade · Martha White · Nicolas Le Roux -
2020 Poster: Towards Safe Policy Improvement for Non-Stationary MDPs »
Yash Chandak · Scott Jordan · Georgios Theocharous · Martha White · Philip Thomas -
2020 Spotlight: Towards Safe Policy Improvement for Non-Stationary MDPs »
Yash Chandak · Scott Jordan · Georgios Theocharous · Martha White · Philip Thomas -
2020 Poster: Security Analysis of Safe and Seldonian Reinforcement Learning Algorithms »
Pinar Ozisik · Philip Thomas -
2020 Tutorial: (Track3) Policy Optimization in Reinforcement Learning »
Sham M Kakade · Martha White · Nicolas Le Roux -
2019 Poster: Offline Contextual Bandits with High Probability Fairness Guarantees »
Blossom Metevier · Stephen Giguere · Sarah Brockman · Ari Kobren · Yuriy Brun · Emma Brunskill · Philip Thomas -
2019 Poster: Importance Resampling for Off-policy Prediction »
Matthew Schlegel · Wesley Chung · Daniel Graves · Jian Qian · Martha White -
2019 Poster: A Meta-MDP Approach to Exploration for Lifelong Reinforcement Learning »
Francisco Garcia · Philip Thomas -
2018 Poster: Context-dependent upper-confidence bounds for directed exploration »
Raksha Kumaraswamy · Matthew Schlegel · Adam White · Martha White -
2017 Poster: Multi-view Matrix Factorization for Linear Dynamical System Estimation »
Mahdi Karami · Martha White · Dale Schuurmans · Csaba Szepesvari -
2015 Poster: Policy Evaluation Using the Ω-Return »
Philip Thomas · Scott Niekum · Georgios Theocharous · George Konidaris -
2013 Poster: Projected Natural Actor-Critic »
Philip Thomas · William C Dabney · Stephen Giguere · Sridhar Mahadevan -
2011 Poster: TD_gamma: Re-evaluating Complex Backups in Temporal Difference Learning »
George Konidaris · Scott Niekum · Philip Thomas -
2011 Poster: Policy Gradient Coagent Networks »
Philip Thomas