Skip to yearly menu bar Skip to main content

Workshop: Foundation Models for Decision Making

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

William Chen · Oier Mees · Aviral Kumar · Sergey Levine


Intelligent beings have the ability to quickly learn new behaviors and tasks by leveraging background world knowledge. This stands in contrast to most agents trained with reinforcement learning (RL), which typically learn behaviors from scratch. Therefore, we would like to endow RL agents with a similar ability to leverage contextual prior information. To this end, we propose a novel approach that uses the vast amounts of general-purpose, diverse, and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data to generate text in response to images and prompts. We initialize RL policies with VLMs by using such models as sources of \textit{promptable representations}: embeddings that are grounded in visual observations and encode semantic features based on the VLM's internal knowledge, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex RL tasks in Minecraft. We find that policies trained on promptable embeddings significantly outperform equivalent policies trained on generic, non-promptable image encoder features. Moreover, we show that promptable representations extracted from general-purpose VLMs outperform both domain-specific representations and instruction-following methods. In ablations, we find that VLM promptability and text generation both are important in yielding good representations for RL. Finally, we give a simple method for evaluating and optimizing prompts used by our approach for a given task without running expensive RL trials, ensuring that it extracts task-relevant semantic features from the VLM.

Chat is not available.