Skip to yearly menu bar Skip to main content

Workshop: Foundation Models for Decision Making

Scaling Offline Q-Learning with Vision Transformers

Yingjie Miao · Jordi Orbay · Rishabh Agarwal · Aviral Kumar · George Tucker · Aleksandra Faust

[ ] [ Project Page ]
presentation: Foundation Models for Decision Making
Fri 15 Dec 6:15 a.m. PST — 3:30 p.m. PST


It has been shown that offline RL methods, such as conservative Q-learning~(CQL), scale favorably for training generalist agents with a ResNet backbone. Recent vision and natural language processing research shows that transformer-based models scale more favorably compared to domain specific models with strong inductive biases (such as convolutional neural networks and recurrent neural networks). In this paper, we investigate how well visual transformers (ViTs) serve as backbones for CQL for training single-game agents. In this work, we enhance the Vision Transformer (ViT) for image-based RL by introducing spatio-temporal attention layers. We further investigate the impact of various embedding sequence aggregation methods on ViT performance and demonstrate that the prevalent mean pooling aggregation is suboptimal. Overall, our modified ViT outperforms the standard ViTs in the single-game Atari setting.

Chat is not available.