Oral
in
Workshop: Foundation Models for Decision Making

Vision-and-Language Navigation in Real World using Foundation Models

Chengguang Xu ⋅ Hieu T. Nguyen ⋅ Christopher Amato ⋅ Lawson Wong

Project Page [ OpenReview]

Abstract

When mobile robots become ubiquitous, they occasionally encounter unseen environments. Enhancing mobile robots with the ability to follow language instructions will improve decision-making efficiency in previously unseen scenarios. However, state-of-the-art (SOTA) vision-and-language navigation (VLN) methods are mainly evaluated in simulation, neglecting the complex real world. Directly transferring SOTA navigation policies learned in simulation to the real world is challenging due to the visual domain gap and the absence of prior knowledge about unseen environments. In this work, we propose a novel navigation framework to address the VLN task in the real world, utilizing the powerful foundation models. Specifically, the proposed framework includes four key components: (1) a large language models (LLMs) based instruction parser that converts a language instruction into a sequence of pre-defined macro-action descriptions, (2) an online visual-language mapper that builds a spatial and semantic map of the unseen environment using large visual-language models (VLMs), (3) a language indexing-based localizer that grounds each macro-action description to a waypoint location on the map, and (4) a pre-trained DD-PPO-based local controller that predicts the action. Evaluated on an Interbotix LoCoBot WX250 in an unseen lab environment, without any fine-tuning, our framework significantly outperforms the SOTA VLN baseline in the real world.

Chat is not available.