Affinity Workshop: Women in Machine Learning

Adapting the Function Approximation Architecture in Online Reinforcement Learning

John Martin · Joseph Modayil · Fatima Davelouis · Michael Bowling


One of the main learning tasks in Reinforcement Learning (RL) is to approximate the value function – a mapping from the present observation to the expected sum of future rewards. Neural network architectures for value function approximation typically impose sparse connections with prior knowledge of observational structure. When this structure is known, architectures such as convolutions, transformers, and graph neural networks can be inductively biased with fixed connections. However, there are times when observational structure is unavailable or too difficult to encode as an architectural bias – for instance, relating sensors that are randomly dispersed in space. Yet in all of these situations it is still desirable to approximate value functions with a sparsely-connected architecture for computational efficiency. An important open question is whether equally-useful representations can be constructed when observational structure is unknown – particularly in the incremental, online setting without access to a replay buffer.Our work is concerned with how a RL system could construct a value function approximation architecture in the absence of observational structure. We propose an online algorithm that adapts connections of a neural network using information derived strictly from the learner’s experience stream, using many parallel auxiliary predictions. Auxiliary predictions are specified as General Value Functions (GVFs) [11], and their weights are used to relate inputs and form subsets we call neighborhoods. These represent the input of fully-connected, random subnetworks that provide nonlinear features for a main value function. We validate our algorithm in a synthetic domain with high-dimensional stochastic observations. Results show that our method can adapt an approximation architecture without incurring substantial performance loss, while also discovering a local degree of spatial structure in the observations without prior knowledge.

Chat is not available.