Skip to yearly menu bar Skip to main content

Workshop: Foundation Models for Decision Making

Skill Acquisition by Instruction Augmentation on Offline Datasets

Ted Xiao · Harris Chan · Pierre Sermanet · Ayzaan Wahid · Anthony Brohan · Karol Hausman · Sergey Levine · Jonathan Tompson


In recent years, much progress has been made in learning robotic manipulation policies that follow natural language instructions. Commonly, such methods learn from a corpora of robot-language data that was either collected with specific tasks in mind or expensively re-labelled by humans with rich language descriptions in hindsight. Recently, large-scale pretrained vision-language models like CLIP have been applied to robotics in the form of learning representations and planners. Can these pretrained models also be used to cheaply impart internet-scale knowledge onto offline datasets, providing access to skills that were not reflected in ground truth labels? To accomplish this, we introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL): we utilize semi-supervised language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data and then train language-conditioned policies on the augmented datasets. This method enables cheaper acquisition of useful language descriptions compared to expensive human labels, allowing for more efficient label coverage of large-scale datasets. We apply DIAL to a challenging real-world robotic manipulation domain, enabling imitation learning policies to acquire new capabilities and generalize to 80 novel instructions unseen in the original dataset.

Chat is not available.