Timezone: »

Biological and Artificial Reinforcement Learning
Raymond Chua · Sara Zannone · Feryal Behbahani · Rui Ponte Costa · Claudia Clopath · Blake Richards · Doina Precup

Fri Dec 13 08:00 AM -- 06:00 PM (PST) @ West Ballroom C
Event URL: https://sites.google.com/view/biologicalandartificialrl/ »

Reinforcement learning (RL) algorithms learn through rewards and a process of trial-and-error. This approach was strongly inspired by the study of animal behaviour and has led to outstanding achievements in machine learning (e.g. in games, robotics, science). However, artificial agents still struggle with a number of difficulties, such as sample efficiency, learning in dynamic environments and over multiple timescales, generalizing and transferring knowledge. On the other end, biological agents excel at these tasks. The brain has evolved to adapt and learn in dynamic environments, while integrating information and learning on different timescales and for different duration. Animals and humans are able to extract information from the environment in efficient ways by directing their attention and actively choosing what to focus on. They can achieve complicated tasks by solving sub-problems and combining knowledge as well as representing the environment in efficient ways and plan their decisions off-line. Neuroscience and cognitive science research has largely focused on elucidating the workings of these mechanisms. Learning more about the neural and cognitive underpinnings of these functions could be key to developing more intelligent and autonomous agents. Similarly, having a computational and theoretical framework, together with a normative perspective to refer to, could and does contribute to elucidate the mechanisms used by animals and humans to perform these tasks. Building on the connection between biological and artificial reinforcement learning, our workshop will bring together leading and emergent researchers from Neuroscience, Psychology and Machine Learning to share: (i) how neural and cognitive mechanisms can provide insights to tackle challenges in RL research and (ii) how machine learning advances can help further our understanding of the brain and behaviour.

Fri 9:00 a.m. - 9:15 a.m.
Opening Remarks (Talk)
Raymond Chua, Feryal Behbahani, Sara Zannone, Rui Ponte Costa, Claudia Clopath, Doina Precup, Blake Richards
Fri 9:15 a.m. - 9:45 a.m.
Invited Talk #1: From brains to agents and back (Talk)
Jane Wang
Fri 9:45 a.m. - 10:30 a.m.
Coffee Break & Poster Session (Poster Session)
Samia Mohinta, Andrea Agostinelli, Alexandra Moringen, Jee Hang Lee, Richie Lo, Wolfgang Maass, Blue Sheffer, Colin Bredenberg, Benjamin Eysenbach, Liyu Xia, Efstratios Markou, Malte Lichtenberg, Pierre Richemond, Tony Zhang, J.B. Lanier, Baihan Lin, Liam Fedus, Glen Berseth, Marta Sarrico, Matthew Crosby, Stephen McAleer, Sina Ghiassian, Franz Scherr, Guillaume Bellec, Darjan Salaj, Arinbjörn Kolbeinsson, Matthew Rosenberg, Jaehoon Shin, Sang Wan Lee, Guillermo Cecchi, Irina Rish, Elias Hajek
Fri 10:30 a.m. - 10:45 a.m.

Humans are great at using prior knowledge to solve novel tasks, but how they do so is not well understood. Recent work showed that in contextual multi-armed bandits environments, humans create simple one-step policies that they can transfer to new contexts by inferring context clusters. However, the daily tasks humans face are often temporally extended, and demand more complex, hierarchically structured skills. The options framework provides a potential solution for representing such transferable skills. Options are abstract multi-step policies, assembled from simple actions or other options, that can represent meaningful reusable skills. We developed a novel two-stage decision making protocol to test if humans learn and transfer multi-step options. We found transfer effects at multiple levels of policy complexity that could not be explained by flat reinforcement learning models. We also devised an option model that can qualitatively replicate the transfer effects in human participants. Our results provide evidence that humans create options, and use them to explore in novel contexts, consequently transferring past knowledge and speeding up learning.

Liyu Xia
Fri 10:45 a.m. - 11:00 a.m.

Recurrent neural networks underlie the astounding information processing capabilities of the brain, and play a key role in many state-of-the-art algorithms in deep reinforcement learning. But it has remained an open question how such networks could learn from rewards in a biologically plausible manner, with synaptic plasticity that is both local and online. We describe such an algorithm that approximates actor-critic policy gradient in recurrent neural networks. Building on an approximation of backpropagation through time (BPTT): e-prop, and using the equivalence between forward and backward view in reinforcement learning (RL), we formulate a novel learning rule for RL that is both online and local, called reward-based e-prop. This learning rule uses neuroscience inspired slow processes and top-down signals, while still being rigorously derived as an approximation to actor-critic policy gradient. To empirically evaluate this algorithm, we consider a delayed reaching task, where an arm is controlled using a recurrent network of spiking neurons. In this task, we show that reward-based e-prop performs as well as an agent trained with actor-critic policy gradient with biologically implausible BPTT.

Wolfgang Maass
Fri 11:00 a.m. - 11:30 a.m.

In the 1950s, Daniel Berlyne wrote extensively about the importance of curiosity – our intrinsic desire to know. To understand curiosity, Berlyne argued, we must explain why humans exert so much effort to obtain knowledge, and how they decide which questions to explore, given that exploration is difficult and its long-term benefits are impossible to ascertain. I propose that these questions, although relatively neglected in neuroscience research, are key to understanding cognition and complex decision making of the type that humans routinely engage in and autonomous agents only aspire to. I will describe our investigations of these questions in two types of paradigms. In one paradigm, agents are placed in contexts with different levels of uncertainty and reward probability and can sample information about the eventual outcome. We find that, in humans and monkeys, information sampling is partially sensitive to uncertainty but is also biased by Pavlovian tendencies, which push agents to engage with signals predicting positive outcomes and avoid those predicting negative outcomes in ways that interfere with a reduction of uncertainty. In a second paradigm, agents are given several tasks of different difficulty and can freely organize their exploration in order to learn. In these contexts, uncertainty-based heuristics become ineffective, and optimal strategies are instead based on learning progress – the ability to first engage with and later reduce uncertainty. I will show evidence that humans are motivated to select difficult tasks consistent with learning maximization, but they guide their task selection according to success rates rather than learning progress per se, which risks trapping them in tasks with too high levels of difficulty (e.g., random unlearnable tasks). Together, the results show that information demand has consistent features that can be quantitatively measured at various levels of complexity, and a research agenda exploring these features will greatly expand our understanding of complex decision strategies.

Jacqueline Gottlieb
Fri 11:30 a.m. - 12:00 p.m.

Reinforcement Learning's principles of temporal difference learning can drive representation learning, even in the absence of rewards. Representation learning is especially important in problems that require a cognitive map (Tollman, 1947), common in mammalian spatial navigation and non-spatial inference, e.g., shortcut- and latent learning, policy revaluation, and remapping. Here I focus on models of predictive cognitive maps that learn successor representations (SR) at multiple scales, and use replay to update SR maps similar to Dyna models (SR-Dyna). SR- and SR-Dyna based representation learning capture biological representation learning reflected in place-, grid-, and distance to goal cell firing patterns (Stachenfled et al. 2017, Momennejad and Howard 2018), the interaction between boundary vector cells and place cells (De Cothi and Barry 2019), subgoal learning (Weinstein and Botvinick 2014), remapping, policy revaluation, and latent learning behavior (Momennejad et al. 2017; Russek, Momennejad et al. 2017). The SR framework makes testable predictions about representation learning in biological systems: e.g., about how predictive features are extracted from visual experience and abstracted into spatial representations that guide navigation. Specifically, the SR is sensitive to the policy the animal has taken during navigation - generating predictions about the representation of goals and how rewarding locations distort the predictive map. Finally, deep RL using SR has been shown to support option discovery, which is especially useful for empowering agents with intrinsic motivation in environments that have sparse rewards and complex structures. These findings can lead to novel directions of human and animal experimentation. I will summarize behavioral and neural findings in human and rodent studies by us and other groups and discuss the road ahead.

Ida Momennejad
Fri 12:00 p.m. - 2:00 p.m.
Lunch Break & Poster Session
Fri 2:00 p.m. - 2:30 p.m.

AI and robotics have made inspiring progress over the recent years on training systems to solve specific, well-defined tasks. But the need to specify tasks bounds the level of complexity that can ultimately be reached in training with such an approach. The sharp distinction between training and deployment stages likewise limits the degree to which these systems can improve and adapt after training. In my talk, I will advocate for multi-agent interaction and online optimization processes as key ingredients to towards overcoming these limitations.

In the first part, I will show that through multi-agent competition, a simple objective such as hide-and-seek game, and standard reinforcement learning algorithms at scale, agents can create a self-supervised autocurriculum with multiple distinct rounds of emergent strategy, many of which require sophisticated tool use and coordination. Multi-agent interaction leads to behaviors that center around more human-relevant skills than other self-supervised reinforcement learning methods such as intrinsic motivation and holds promise of open-ended growth of complexity.

In the second part, I will argue for usefulness and generality of online optimization processes and show examples of incorporating them in model-based control and generative modeling contexts via energy-based models. I will show intriguing advantages, such as compositionality, robustness to distribution shift, non-stationarity, and adversarial attacks in generative modeling problems and planned exploration and fast adaptation to changing environments in control problems.

This is joint work with many wonderful colleagues and students at OpenAI, MIT, University of Washington, and UC Berkeley.

Igor Mordatch
Fri 2:30 p.m. - 3:00 p.m.

I will describe how alternatives to conventional neural networks that are very loosely biologically inspired can improve meta-learning, including continual learning. First I will summarize differentiable Hebbian learning and differentiable neuromodulated Hebbian learning (aka “backpropamine”). Both are techniques for training deep neural networks with synaptic plasticity, meaning the weights can change during meta-testing/inference. Whereas meta-learned RNNs can only store within-episode information in their activations, such plastic Hebbian networks can store information in their weights in addition to its activations, improving performance on some classes of problems. Second, I will describe a new, unpublished method that improves the state of the art in continual learning. ANML (A Neuromodulated Meta-Learning algorithm) meta-learns a neuromodulatory network that gates the activity of the main prediction network, enabling the learning of up to 600 simple tasks sequentially.

Jeff Clune
Fri 3:00 p.m. - 3:30 p.m.

Combining a multi-armed bandit task and Bayesian computational modeling, we find that humans systematically under-estimate reward availability in the environment. This apparent pessimism turns out to be an optimism bias in disguise, and one that compensates for other idiosyncrasies in human learning and decision-making under uncertainty, such as a default tendency to assume non-stationarity in environmental statistics as well as the adoption of a simplistic decision policy. In particular, reward rate underestimation discourages the decision-maker from switching away from a “good” option, thus achieving near-optimal behavior (which never switches away after a win). Furthermore, we demonstrate that the Bayesian model that best predicts human behavior is equivalent to a particular form of Q-learning often used in the brain sciences, thus providing statistical, normative grounding to phenomenological models of human and animal behavior.

Angela Yu
Fri 3:30 p.m. - 4:15 p.m.
Coffee Break & Poster Session (Poster Session)
Fri 4:15 p.m. - 4:30 p.m.

Modern Reinforcement Learning (RL) algorithms, even those with intrinsic reward bonuses, suffer performance plateaus in hard-exploration domains suggesting these algorithms have reached their ceiling. However, in what we describe as the MEMENTO observation, we find that new agents launched from the position where the previous agent saturated, can reliably make further progress. We show that this is not an artifact of limited model capacity or training duration, but rather indicative of interference in learning dynamics between various stages of the domain [Schaul et al., 2019], signatures of multi-task and continual learning. To mitigate interference we design an end-to-end learning agent which partitions the environment into various segments, and models the value function separately in each score context per Jain et al. [2019]. We demonstrate increased learning performance by this ensemble of agents on Montezuma’s Revenge and further show how this ensemble can be distilled into a single agent with the same model capacity as the original learner. Since the solution is empirically expressible by the original network, this provides evidence of interference and our approach validates an avenue to circumvent it.

Liam Fedus
Fri 4:30 p.m. - 5:00 p.m.
Invited Talk #7: Richard Sutton (Talk)
Richard Sutton
Fri 5:00 p.m. - 6:00 p.m.
Panel Discussion led by Grace Lindsay (Discussion Panel)
Grace Lindsay, Blake Richards, Doina Precup, Jacqueline Gottlieb, Jeff Clune, Jane Wang, Richard Sutton, Angela Yu, Ida Momennejad

Author Information

Raymond Chua (McGill University / Mila)
Sara Zannone (ICL)
Feryal Behbahani (DeepMind)
Rui Ponte Costa (University of Bristol)
Claudia Clopath (Imperial College London)
Blake Richards (University of Toronto)
Doina Precup (McGill University / Mila / DeepMind Montreal)

More from the Same Authors