floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL
Abstract
A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token in language models or denoising step-by-step in diffusion models, enabling models to learn complex functions in a generalizable manner. Motivated by this observation, we investigate the benefits of iterative computation for temporal difference (TD) methods in reinforcement learning (RL), which typically represent value functions in a monolithic fashion. We introduce floq (flow-matching Q-functions), an approach that parameterizes the Q-function using a velocity field and trains it using techniques from flow-matching, typically used in generative modeling. This velocity field is trained using TD-learning, which bootstraps from values produced by a target flow, computed by running multiple steps of numerical integration. floq allows for more fine-grained control and scaling of the Q-function capacity than monolithic architectures, by appropriately setting the number of integration steps. Across a suite of 50 challenging offline RL and online fine-tuning tasks, floq demonstrates superior performance, improving by ~ 2x in hard tasks. floq scales capacity far better than standard Q-function architectures, highlighting the potential of iterative computation for value learning.