Foundation Models for Decision Making

Mengjiao (Sherry) Yang · Yilun Du · Jack Parker-Holder · Siddharth Karamcheti · Igor Mordatch · Shixiang (Shane) Gu · Ofir Nachum

Room 391
Abstract Workshop Website
Sat 3 Dec, 6:50 a.m. PST


Humans acquire vision, language, and decision making abilities through years of experience, arguably corresponding to millions of video frames, audio clips, and interactions with the world. Following this data-driven approach, recent foundation models trained on large and diverse datasets have demonstrated emergent capabilities and fast adaptation to a wide range of downstream vision and language tasks (e.g., BERT, DALL-E, GPT-3, CLIP). Meanwhile in the decision making and reinforcement learning (RL) literature, foundation models have yet to fundamentally shift the traditional paradigm in which an agent learns from its own or others’ collected experience, typically on a single-task and with limited prior knowledge. Nevertheless, there has been a growing body of foundation-model-inspired research in decision making that often involves collecting large amounts of interactive data for self-supervised learning at scale. For instance, foundation models such as BERT and GPT-3 have been applied to modeling trajectory sequences of agent experience, and ever-larger datasets have been curated for learning multimodel, multitask, and generalist agents. These works demonstrate the potential benefits of foundation models on a broad set of decision making applications such as autonomous driving, healthcare systems, robotics, goal-oriented dialogue, robotics, and recommendation systems.

Despite early signs of success, foundation models for decision making remain largely underexplored, underutilized, and lacking solid empirical and theoretical grounding. The challenges faced by existing research are as follows:
1. Many traditional decision making benchmarks are (near-)Markovian (i.e., historyless), and this brings the value of sequence modeling into question. The true power of foundation models may require more complex tasks.
2. Decision making tasks are composed of multi-modal data. At minimum, the states (observations), actions, and rewards of a task are each of different types. Moreover, across different tasks, states and actions can be highly distinct (image vs. text observations, discrete vs. continuous actions).
3. Unlike vision and language, decision making agents can further interact with the environment to collect additional experience in conjunction with learning on existing data. How such an interactive component should be integrated with foundation models is not clear.
4. There already exhibits a large gap between theory and practice in decision making. Hastily applying large models to decision making might create an even greater gap.

Goal of the workshop: The goal of this workshop is to bring together the decision making community and the foundation models community in vision and language to confront the challenges in decision making at scale. The workshop will span high-level discussions on how foundation models can help decision making (if at all) and low-level algorithmic differences of decision, vision, and language which might lead to both opportunities or challenges for applying foundation models to decision making. More specific topics will include but are not limited to:
1. Common or distinct properties of vision, language, and decision making tasks that reassure or challenge the value of foundation models in decision making.
2. Introduction or proposals for new benchmarks to facilitate better research for foundation models for decision making.
3. How decision making can benefit from techniques already popular for foundation models, such as autoregressive sequence models, diffusion models, contrastive pretraining, masked autoencoders, prompting, etc.
4. Lessons learned from developing engineering frameworks, datasets and benchmarks, and evaluation protocols for foundation models in vision and language, and how can the decision making community benefit from these lessons.
5. How foundation models relate to the theoretical foundations of sequential decision making.

Live content is unavailable. Log in and register to view live content

Timezone: America/Los_Angeles »


Log in and register to view live content