Skip to yearly menu bar Skip to main content

Workshop: Foundation Models for Decision Making

RoboVQA: Multimodal Long-Horizon Reasoningfor Robotics

Pierre Sermanet · Tianli Ding · Jeffrey Zhao · Fei Xia · Debidatta Dwibedi · Keerthana Gopalakrishnan · Gabriel Dulac-Arnold · sharath maddineni · Nikhil Joshi · Pete Florence · Wei Han · Robert Baruch · Yao Lu · Suvir Mirchandani · Peng Xu · Pannag Sanketi · Karol Hausman · Izhak Shafran · brian ichter · Yuan Cao

[ ] [ Project Page ]
presentation: Foundation Models for Decision Making
Fri 15 Dec 6:15 a.m. PST — 3:30 p.m. PST


We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2.2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple embodiments (robot, human, human with grasping tool). With this data, we show that models trained on all embodiments perform better than ones trained on the robot data only, even when evaluated solely on robot episodes. We explore the economics of collection costs and find that for a fixed budget it is beneficial to take advantage of the cheaper human collection along with robot collection. We release a large and highly diverse (29,520 unique instructions) dataset dubbed \algname{} containing 829,502 (video, text) pairs for robotics-focused visual question answering. We also demonstrate how evaluating real robot experiments with an intervention mechanism enables performing tasks to completion, making it deployable with human oversight even if imperfect while also providing a single performance metric. We demonstrate a single video-conditioned model named \modelname{} trained on our dataset that is capable of performing a variety of grounded high-level reasoning tasks in broad realistic settings with a cognitive intervention rate 46\% lower than the zero-shot state of the art visual language model (VLM) baseline and is able to guide real robots through long-horizon tasks. The performance gap with zero-shot state-of-the-art models indicates that a lot of grounded data remains to be collected for real-world deployment, emphasizing the critical need for scalable data collection approaches. Finally, we show that video VLMs significantly outperform single-image VLMs with an average error rate reduction of 19\% across all VQA tasks. Thanks to video conditioning and dataset diversity, the model can be used as general video value functions (e.g. success and affordance) in situations where actions needs to be recognized rather than states, expanding capabilities and environment understanding for robots. Data and videos are available at

Chat is not available.