Workshop
6th Robot Learning Workshop: Pretraining, Fine-Tuning, and Generalization with Large Scale Models
Dhruv Shah · Paula Wulkop · Claas Voelcker · Georgia Chalvatzaki · Alex Bewley · Hamidreza Kasaei · Ransalu Senanayake · Julien PEREZ · Jonathan Tompson
Hall B2 (level 1)
The proposed workshop focuses on the intersection of machine learning (ML) and robotics, under this year’s focus topic: “Pretraining, Fine-Tuning, and Generalization with Large Scale Models.” Embodied AI and robotics pose unique challenges and opportunities for utilizing large pre-trained models. We seek to host a diverse set of views and approaches from across the robotics domain and dive deep into questions such as: What sources of data can be used for training large models in robotics? What role should pre-training play in robotics pipelines? How far can pre-trained models generalize when faced with novel tasks and environments? What is currently missing to the pre-training paradigm for embodied systems?
Schedule
Sat 6:15 a.m. - 6:20 a.m.
|
Opening Remarks
(
Presentation
)
>
SlidesLive Video |
🔗 |
Sat 6:20 a.m. - 6:45 a.m.
|
Keynote Masha Itkina
(
Talk
)
>
SlidesLive Video Affiliation: Research Scientist at the Toyota Research Institute. Research focus: uncertainty-aware learning algorithms for robots in human environments, focusing on the self-driving and assistive robotics domains. |
🔗 |
Sat 6:45 a.m. - 7:10 a.m.
|
Keynote Jesse Thomason
(
Talk
)
>
SlidesLive Video Affiliation: Assistant Professor at University of Southern California (USC). Research focus: bringing together natural language processing and robotics to connect language to the world (RoboNLP). He is interested in connecting language to agent perception and action, and lifelong learning through interaction. |
🔗 |
Sat 7:10 a.m. - 7:35 a.m.
|
Keynote Dhruv Batra and Arjun Majumdar
(
Talk
)
>
SlidesLive Video Affiliation: Dhruv: Associate Professor in the School of Interactive Computing at Georgia Tech and a Research Director leading the Embodied AI and robotics efforts in the Fundamental AI Research (FAIR) team at Meta. Arjun: PhD Student at Georgia Tech. Research focus: at the intersection of machine learning and computer vision, with forays into robotics and natural language processing. |
🔗 |
Sat 7:35 a.m. - 8:00 a.m.
|
Keynote Deepak Pathak
(
Talk
)
>
SlidesLive Video Affiliation: Assistant Professor at Carnegie Mellon University in the School of Computer Science. Research focus: Artificial Intelligence at the intersection of Computer Vision, Machine Learning & Robotics with the ultimate goal to build agents with a human-like ability to generalize in real and diverse environments. |
🔗 |
Sat 8:00 a.m. - 9:00 a.m.
|
Poster Session + Robot Demos I
(
Poster Session
)
>
|
🔗 |
Sat 9:00 a.m. - 9:40 a.m.
|
Oral Spotlights
(
Presentation
)
>
SlidesLive Video 11:00-11:05: MultiReAct: Multimodal Tools Augmented Reasoning-Acting Traces for Embodied Agent Planning; Zhouliang Yu, Fu, Mu, Wang, Shao, Yang 11:06-11:11: How to Prompt Your Robot: A PromptBook for Manipulation Skills with Code as Policies; Montse Gonzalez Arenas, Xiao, Singh, Jain, Ren, Vuong, Varley, Herzog, Leal, Kirmani, Sadigh, Sindhwani, Rao, Liang, Zeng 11:12-11:17: RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot; Hao-Shu Fang, Fang, Tang, Liu, Wang, Wang, Zhu, Lu 11:18-11:23: Exploitation-Guided Exploration for Semantic Embodied Navigation; Justin Wasserman, Chowdhary, Gupta, Jain 11:24-11:29: Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning; Yang, Sobol Mark, Vu, Archit Sharma, Bohg, Finn 11:30-11:40: Q&A |
🔗 |
Sat 9:40 a.m. - 10:10 a.m.
|
Panel: How much are physical robots still needed in current robot learning research?
(
Panel
)
>
SlidesLive Video Deepak Pathak, Dhruv Batra, Montse Gonzalez Arenas, Andrey Kolobov |
🔗 |
Sat 10:10 a.m. - 11:30 a.m.
|
Lunch Break
(
Break
)
>
|
🔗 |
Sat 11:30 a.m. - 11:35 a.m.
|
Keynote Suraj Nair
(
Talk
)
>
SlidesLive Video Affiliation: Research Scientist on the ML Research Team at the Toyota Research Institute. Research focus: at the intersection of machine learning, robotics, and computer vision with a focus on leveraging language, video, and robot data to train foundation models for embodied AI. |
🔗 |
Sat 11:55 a.m. - 12:20 p.m.
|
Keynote Matt Barnes
(
Talk
)
>
SlidesLive Video Affiliation: Google Research. Research focus: Currently working on applied health and safety AI. Published in computer vision, time-series forecasting, imitation learning, clustering, cross-validation techniques, and reinforcement learning. |
🔗 |
Sat 12:20 p.m. - 12:45 p.m.
|
Keynote Keerthana Gopalakrishnan and Montserrat Gonzalez Arenas
(
Talk
)
>
SlidesLive Video Affiliation: Google Brain. Research focus: How to scale robotics and build general purpose intelligence in the physical world. |
🔗 |
Sat 12:45 p.m. - 2:15 p.m.
|
Poster Session + Robot demos II
(
Poster Session
)
>
|
🔗 |
Sat 2:15 p.m. - 3:15 p.m.
|
Debate: Scaling models and data size is sufficient for deploying robots in the real world
(
Debate
)
>
SlidesLive Video Suraj Nair, Jesse Thomason, Masha Itkina, Fei Xia |
🔗 |
Sat 3:15 p.m. - 3:30 p.m.
|
Closing remarks & Awards
(
Presentation
)
>
SlidesLive Video |
🔗 |
-
|
Decomposing the Generalization Gap in Imitation Learning for Visual Robotic Manipulation
(
Poster
)
>
link
What makes generalization hard for imitation learning in visual robotic manipulation? This question is difficult to approach at face value, but the environment from the perspective of a robot can often be decomposed into enumerable factors of variation, such as the lighting conditions or the placement of the camera. Empirically, generalization to some of these factors have presented a greater obstacle than others, but existing work sheds little light on precisely how much each factor contributes to the generalization gap. Towards an answer to this question, we study imitation learning policies in simulation and on a real robot language-conditioned manipulation task to quantify the difficulty of generalization to different (sets of) factors. We also design a new simulated benchmark of 19 tasks with 11 factors of variation to facilitate more controlled evaluations of generalization. From our study, we determine an ordering of factors based on generalization difficulty, that is consistent across simulation and our real robot setup. |
Annie Xie · Lisa Lee · Ted Xiao · Chelsea Finn 🔗 |
-
|
Pre-Trained Binocular ViTs for Image-Goal Navigation
(
Poster
)
>
link
Most recent work in visual goal-oriented navigation resorts to large-scale machine learning in simulated environments. The main challenge lies in learning compact map-like representations that generalize to unseen environments and high-capacity perception modules capable of reasoning on high-dimensional input. The latter is particularly difficult when the goal is given as an exemplar image (Image Goal), as the perception module needs to learn a comparison strategy requiring to solve an underlying visual correspondence problem. This has been shown to be difficult from reward alone or with standard auxiliary tasks. We address this problem using two pretext tasks, which serve as a prior for what we argue is one of the main bottleneck in perception: wide-baseline relative pose estimation and visibility prediction in complex scenes. Our first pretext task, cross-view completion, is a proxy for the underlying visual correspondence problem, while the second task addresses goal detection and localization directly. We propose a new dual encoder making use of a binocular ViT model. Experiments show significant improvements on Image Goal navigation performance. |
Guillaume Bono · Leonid Antsfeld · Boris Chidlovskii · Philippe Weinzaepfel · Christian Wolf 🔗 |
-
|
Sample-Efficient Online Imitation Learning using Pretrained Behavioural Cloning Policies
(
Poster
)
>
link
Recent advances in robot learning have been enabled by learning rich generative and recurrent policies from expert demonstrations, such as human teleoperation.These policies are capable of solving many complex tasks by accurately modelling human behaviour, which may be multimodal and non-Markovian.However, this imitation learning approach of behavioural cloning (BC) is limited to being offline, which increases the requirement for large expert demonstration datasets and does not enable the policy to learn from its own experience.In this work, we review the recent imitation learning algorithm coherent soft imitation learning (CSIL) and outline how it could be applied to more complex policy architectures.CSIL demonstrates that inverse reinforcement learning can be achieved using only a behaviour cloning policy, which means that its learned reward can be used to further improve a BC policy using additional online interactions. However, CSIL has only been demonstrated using simple feedforward network policies, so we discuss how such an imitation learning algorithm could be applied to more complex policy architectures, such as those including transformers and diffusion models. |
Joe Watson · Jan Peters 🔗 |
-
|
DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models
(
Poster
)
>
link
We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting with a novel object, DINOBot first uses these features to retrieve the most visually similar object experienced during human demonstrations, and then uses this object to align its end-effector with the novel object to enable effective interaction. Through a series of real-world experiments on everyday tasks, we show that exploiting both the image-level and pixel-level properties of vision foundation models enables unprecedented learning efficiency and generalisation. Videos and code are available at https://sites.google.com/view/dinobot/. |
Norman Di Palo · Edward Johns 🔗 |
-
|
MultiReAct: Multimodal Tools Augmented Reasoning-Acting Traces for Embodied Agent Planning
(
Poster
)
>
link
Large Language Models (LLMs) have demonstrated impressive proficiency in tasks involving simple reasoning. However, they face significant challenges when confronted with longer-horizon tasks described in abstract instructions.These challenges stem from two main limitations.Firstly, text-only LLMs struggle to cope with the demands of complex embodied tasks that require nuanced multimodal reasoning. Secondly, LLMs encounter difficulties in recognizing and autonomously recovering from intermediate execution failures.To overcome these limitations and enhance the planning capabilities of LLMs in embodied scenarios, we propose a novel approach called MultiReAct.Our framework made the following efforts:We utilize a parameter-efficient adaptation of a pre-trained visual language model, enabling it to tackle embodied planning tasks by converting visual demonstrations into sequences of actionable language commands.By leveraging CLIP as a reward model, we identify instances of sub-instruction execution failure, significantly increasing the success rate in achieving final objectives.We introduce an adaptable paradigm for embodied planning through in-context learning from demonstration, independent of the specific Visual Language Model (VLM), and low-level actor. Our framework supports two distinct low-level actors: an imitation learning agent and a code generation-based actor.Using the MultiReAct framework, we apply it to a diverse set of long-horizon planning tasks and demonstrate superior performance compared to previous LLM-based methods. The extensive experimental results underscore the effectiveness of our approach in addressing long-horizon embodied planning. |
Zhouliang Yu · Jie Fu · Yao Mu · Chenguang Wang · Lin Shao · Yaodong Yang 🔗 |
-
|
MultiReAct: Multimodal Tools Augmented Reasoning-Acting Traces for Embodied Agent Planning
(
Spotlight
)
>
link
Large Language Models (LLMs) have demonstrated impressive proficiency in tasks involving simple reasoning. However, they face significant challenges when confronted with longer-horizon tasks described in abstract instructions.These challenges stem from two main limitations.Firstly, text-only LLMs struggle to cope with the demands of complex embodied tasks that require nuanced multimodal reasoning. Secondly, LLMs encounter difficulties in recognizing and autonomously recovering from intermediate execution failures.To overcome these limitations and enhance the planning capabilities of LLMs in embodied scenarios, we propose a novel approach called MultiReAct.Our framework made the following efforts:We utilize a parameter-efficient adaptation of a pre-trained visual language model, enabling it to tackle embodied planning tasks by converting visual demonstrations into sequences of actionable language commands.By leveraging CLIP as a reward model, we identify instances of sub-instruction execution failure, significantly increasing the success rate in achieving final objectives.We introduce an adaptable paradigm for embodied planning through in-context learning from demonstration, independent of the specific Visual Language Model (VLM), and low-level actor. Our framework supports two distinct low-level actors: an imitation learning agent and a code generation-based actor.Using the MultiReAct framework, we apply it to a diverse set of long-horizon planning tasks and demonstrate superior performance compared to previous LLM-based methods. The extensive experimental results underscore the effectiveness of our approach in addressing long-horizon embodied planning. |
Zhouliang Yu · Jie Fu · Yao Mu · Chenguang Wang · Lin Shao · Yaodong Yang 🔗 |
-
|
Reinforcement-learning robotic sailboats: simulator and preliminary results
(
Poster
)
>
link
This work focuses on the main challenges and problems in developing a virtual oceanic environment reproducing real experiments using Unmanned Surface Vehicles (USV) digital twins. We introduce the key features for building virtual worlds, thinking about using Reinforcement Learning (RL) agents for autonomous navigation and control. With this in mind, the main problems concern the definition of the simulation equations (physics and mathematics), their effective implementation, and how to include strategies for simulated control and perception (sensors) to be used with RL. We present the modeling and implementation steps and challenges required to create a functional digital twin based on a real robotic sailing vessel. The application is immediate for developing navigation algorithms based on RL to be applied on real boats. |
Eduardo Vasconcellos · Ronald M. Sampaio · ANDRE PAULO ARAUJO · Esteban CLUA · philippe preux · Luiz Marcos Garcia Goncalves · Luis Martí 🔗 |
-
|
Learning to Act from Actionless Videos through Dense Correspondences
(
Poster
)
>
link
In this work, we present an approach to construct a video-based robot policy capable of reliably executing diverse tasks across different robots and environments from few video demonstrations without using any action annotations. Our method leverages images as a task-agnostic representation, encoding both the state and action information, and text as a general representation for specifying robot goals. By synthesizing videos that “hallucinate” robot executing actions and in combination with dense correspondences between frames, our approach can infer the closed-formed action to execute to an environment without the need of any explicit action labels. This unique capability allows us to train the policy solely based on RGB videos and deploy learned policies to various robotic tasks. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks. Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models with four GPUs within a single day. |
Po-Chen Ko · Jiayuan Mao · Yilun Du · Shao-Hua Sun · Josh Tenenbaum 🔗 |
-
|
EvIL: Evolution Strategies for Generalisable Imitation Learning
(
Poster
)
>
link
We present Evolutionary Imitation Learning (EvIL), a general approach to imitation learning (IL) able to predict agent behaviour across changing environment dynamics. In EvIL, we use Evolution Strategies to jointly meta-optimise the parameters (e.g. reward functions and dynamics) fed to an inner loop reinforcement learning procedure. In effect, this allows us to inherit some of the benefits of the inverse reinforcement learning approach to imitation learning while being significantly more flexible. Specifically, our algorithm can be applied with any policy optimisation method, without requiring the reward or training procedure to be differentiable. Our method succeeds at recovering a reward that induces expert-like behaviour across a variety of environments, even when the environment dynamics are not fully known. We test our method's effectiveness and generalisation capabilities in several tabular environments and continuous control settings and find that it outperforms both offline approaches, like behavioural cloning, and traditional inverse reinforcement learning techniques. |
Silvia Sapora · Chris Lu · Gokul Swamy · Yee Whye Teh · Jakob Foerster 🔗 |
-
|
A Statistical Guarantee for Representation Transfer in Multitask Imitation Learning
(
Poster
)
>
link
Transferring representation for multitask imitation learning has the potential to provide improved sample efficiency on learning new tasks, when compared to learning from scratch. In this work, we provide a statistical guarantee indicating that we can indeed achieve improved sample efficiency on the target task when a representation is trained using sufficiently diverse source tasks. Our theoretical results can be readily extended to account for commonly used neural network architectures with realistic assumptions. We conduct empirical analyses that align with our theoretical findings on four simulated environments---in particular leveraging more data from source tasks can improve sample efficiency on learning in the new task. |
Bryan Chan · James Bergstra · Karime Pereida 🔗 |
-
|
CAJun: Continuous Adaptive Jumping using a Learned Centroidal Controller
(
Poster
)
>
link
We present CAJun, a novel hierarchical learning and control framework that enables legged robots to jump continuously with adaptive jumping distances. CAJun consists of a high-level centroidal policy and a low-level leg controller. In particular, we use reinforcement learning (RL) to train the centroidal policy, which specifies the gait timing, base velocity, and swing foot position for the leg controller.The leg controller optimizes motor commands for the swing and stance legs according to the gait timing to track the swing foot target and base velocity commands. Additionally, we reformulate the stance leg optimizer in the leg controller to speed up policy training by an order of magnitude. Our system combines the versatility of learning with the robustness of optimal control. We show that after 20 minutes of training, CAJun can achieve continuous, long jumps with adaptive distances on a Go1 robot with small sim-to-real gaps. Moreover, the robot can jump across gaps with a maximum width of 70cm, which is over 40\% wider than existing methods |
Yuxiang Yang · Guanya Shi · Xiangyun Meng · Wenhao Yu · Tingnan Zhang · Jie Tan · Byron Boots 🔗 |
-
|
Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks
(
Poster
)
>
link
Large Language Models (LLMs) are highly capable of performing planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL is capable of solving 20+ challenging single and multi-stage robotics tasks on four benchmarks at success rates of over 80% from raw visual input, out-performing language-based, classical, and end-to-end approaches. Video results and code at https://planseqlearn.github.io/ |
Murtaza Dalal · Tarun Chiruvolu · Devendra Singh Chaplot · Russ Salakhutdinov 🔗 |
-
|
TD-MPC2: Scalable, Robust World Models for Continuous Control
(
Poster
)
>
link
TD-MPC is a model-based reinforcement learning (RL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model. In this work, we present TD-MPC2: a series of improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves significantly over baselines across 104 online RL tasks spanning 4 diverse task domains, achieving consistently strong results with a single set of hyperparameters. We further show that agent capabilities increase with model and data size, and successfully train a single 317M parameter agent to perform 80 tasks across multiple task domains, embodiments, and action spaces.Explore videos, models, data, code, and more at https://tdmpc2.com |
Nicklas Hansen · Hao Su · Xiaolong Wang 🔗 |
-
|
Reasoning with Latent Diffusion in Offline Reinforcement Learning
(
Poster
)
>
link
Offline reinforcement learning (RL) holds promise as a means to learn high-reward policies from large static datasets, without need for further environment interactions. This is especially critical for robotics, where online learning can be prohibitively expensive. However, a key challenge in offline RL lies in effectively stitching portions of suboptimal trajectories from the static dataset while avoiding extrapolation errors arising due to a lack of support in the dataset. In this work, we propose a novel approach that leverages the expressiveness of latent diffusion to model in-support trajectory sequences as compressed latent skills. This facilitates learning a Q-function while avoiding extrapolation error via batch-constraining. The latent space is also expressive and gracefully copes with multi-modal data. We show that the learned temporally-abstract latent space encodes richer task-specific information for offline RL tasks as compared to raw state-actions. This improves credit assignment and facilitates faster reward propagation during Q-learning. Our method demonstrates state-of-the-art performance on the D4RL benchmarks, particularly excelling in long-horizon, sparse-reward tasks. |
Siddarth Venkatraman · Shivesh Khaitan · Ravi Tej Akella · John Dolan · Jeff Schneider · Glen Berseth 🔗 |
-
|
Knolling bot 2.0: Enhancing Object Organization with Self-supervised Graspability Estimation
(
Poster
)
>
link
Building on recent advancements in transformer-based approaches for domestic robots performing 'knolling'—the art of organizing scattered items into neat arrangements—this paper introduces Knolling bot 2.0. Recognizing the challenges posed by piles of objects or items situated closely together, this upgraded system incorporates a self-supervised graspability estimation model. If objects are deemed ungraspable, an additional behavior will be executed to separate the objects before knolling the table. By integrating this grasp prediction mechanism with existing visual perception and transformer-based knolling models, an advanced system capable of decluttering and organizing even more complex and densely populated table settings is demonstrated. Experimental evaluations demonstrate the effectiveness of this module, yielding a graspability prediction accuracy of 95.7%. |
Yuhang Hu · Zhizhuo Zhang · Hod Lipson 🔗 |
-
|
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
(
Poster
)
>
link
Open X-Embodiment: Robotic Learning Datasets and RT-X Models |
Quan Vuong · Ajinkya Jain · Alex Bewley · Alexander Irpan · Alexander Khazatsky · Anant Rai · Anikait Singh · Antonin Raffin · Ayzaan Wahid · Beomjoon Kim · Bernhard Schölkopf · brian ichter · Cewu Lu · Charles Xu · Chelsea Finn · Chenfeng Xu · Cheng Chi · Chenguang Huang · Chuer Pan · Chuyuan Fu · Coline Devin · Danny Driess · Deepak Pathak · Dhruv Shah · Dieter Büchler · Dmitry Kalashnikov · Dorsa Sadigh · Edward Johns · Federico Ceola · Fei Xia · Freek Stulp · Gaoyue Zhou · Gaurav Sukhatme · Gautam Salhotra · Ge Yan · Giulio Schiavi · Hao Su · Hao-Shu Fang · Haochen Shi · Heni Ben Amor · Henrik Christensen · Hiroki Furuta · Homer Walke · Hongjie Fang · Igor Mordatch · Ilija Radosavovic · Isabel Leal · Jacky Liang · Jaehyung Kim · Jan Schneider · Jasmine Hsu · Jeannette Bohg · Jiajun Wu · Jialin Wu · Jianlan Luo · Jiayuan Gu · Jie Tan · Jitendra Malik · Jonathan Tompson · Jonathan Yang · Joseph Lim · João Silvério · Junhyek Han · Kanishka Rao · Karl Pertsch · Karol Hausman · Keegan Go · Keerthana Gopalakrishnan · Ken Goldberg · Kevin Zhang · Keyvan Majd · Krishan Rana · Krishnan Srinivasan · Lawrence Yunliang Chen · Lerrel Pinto · Liam Tan · Lionel Ott · Lisa Lee · Masayoshi TOMIZUKA · Michael Ahn · Mingyu Ding · Mohan Kumar Srirama · Mohit Sharma · Moo J Kim · Nicklas Hansen · Nicolas Heess · Nikhil Joshi · Niko Suenderhauf · Norman Di Palo · Nur Muhammad Shafiullah · Oier Mees · Oliver Kroemer · Pannag Sanketi · Paul Wohlhart · Peng Xu · Pierre Sermanet · Priya Sundaresan · Rafael Rafailov · Ran Tian · Ria Doshi · Roberto Martín-Martín · Russell Mendonca · Rutav Shah · Ryan Hoque · Ryan Julian · Samuel Bustamante · Sean Kirmani · Sergey Levine · Sherry Q Moore · Shikhar Bahl · Shivin Dass · Shuran Song · Sichun Xu · Siddhant Haldar · Simeon Adebola · Simon Guist · Soroush Nasiriany · Stefan Schaal · Stefan Welker · Stephen Tian · Sudeep Dasari · Suneel Belkhale · Takayuki Osa · Tatsuya Harada · Tatsuya Matsushima · Ted Xiao · Tianhe Yu · Tianli Ding · Todor Davchev · Tony Zhao · Trevor Darrell · Vidhi Jain · Vincent Vanhoucke · Wei Zhan · Wenxuan Zhou · Wolfram Burgard · Xi Chen · Xiaolong Wang · Xinghao Zhu · Xuanlin Li · Yao Lu · Yevgen Chebotar · Yifan Zhou · Yifeng Zhu · Yonatan Bisk · Yoonyoung Cho · Youngwoon Lee · Yuchen Cui · Yueh-Hua Wu · Yujin Tang · Yuke Zhu · Yunzhu Li · Yusuke Iwasawa · Yutaka Matsuo · Zhuo Xu · Zichen Cui · Alexander Herzog · Abhishek Padalkar · Acorn Pooley · Anthony Brohan · Ben Burgess-Limerick · Christine Chan · Jeffrey Bingham · Jihoon Oh · Kendra Byrne · Kenneth Oslund · Kento Kawaharazuka · Maximilian Du · Mingtong Zhang · Naoaki Kanazawa · Travis Armstrong · Ying Xu · Yixuan Wang · Jan Peters
|
-
|
T3GDT: Three-Tier Tokens to Guide Decision Transformer for Offline Meta Reinforcement Learning
(
Poster
)
>
link
Offline meta-reinforcement learning (OMRL) aims to generalize an agent's knowledge from training tasks with offline data to a new unknown RL task with few demonstration trajectories. This paper proposes T3GDT: Three-tier tokens to Guide Decision Transformer for OMRL. First, our approach learns a global token from its demonstrations to summarize a RL task's transition dynamic and reward pattern. This global token specifies the task identity and prepends as the first token for prompting this task's RL roll-out. Second, for each time step $t$, we learn adaptive tokens retrieved from top-relevant experiences in the demonstration. These tokens are fused to improve action prediction at timestep $t$. Third, we replace lookup table-based time embedding with TimetoVec embedding that combines time neighboring relationships into better time representation for RL. Empirically, we compare T3GDT with prompt decision transformer variants and MACAW across five different RL environments from both MuJoCo control and MetaWorld benchmarks.
|
Zhe Wang · Haozhu Wang · Yanjun Qi 🔗 |
-
|
Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models
(
Poster
)
>
link
We introduce Dream2Real, a robotics framework which integrates 2D vision-language models into a 3D object rearrangement method. The robot autonomously constructs a 3D NeRF-based representation of the scene, where objects can be rendered in novel arrangements. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world via pick-and-place. Real-world results show that this framework enables zero-shot rearrangement, avoiding the need to collect a dataset of example arrangements. |
Ivan Kapelyukh · Yifei Ren · Ignacio Alzugaray · Edward Johns 🔗 |
-
|
Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models
(
Spotlight
)
>
link
We introduce Dream2Real, a robotics framework which integrates 2D vision-language models into a 3D object rearrangement method. The robot autonomously constructs a 3D NeRF-based representation of the scene, where objects can be rendered in novel arrangements. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world via pick-and-place. Real-world results show that this framework enables zero-shot rearrangement, avoiding the need to collect a dataset of example arrangements. |
Ivan Kapelyukh · Yifei Ren · Ignacio Alzugaray · Edward Johns 🔗 |
-
|
Policy-Guided Diffusion
(
Poster
)
>
link
Model-free methods for offline reinforcement learning typically suffer from value overestimation, resulting from generalization to out-of-sample state-action pairs. On the other hand, model-based methods must handle in compounding errors in transition dynamics, as the policy is rolled out using the learned model. As a solution, we propose policy-guided diffusion (PGD). Our method generates entire trajectories using a diffusion model, with an additional policy guidance term that biases samples towards the policy being trained. Evaluating PGD on the Adroit manipulation environment, we show that guidance dramatically increases trajectory likelihood under the target policy, without increasing model error. When training offline RL agents on purely synthetic data, our early results show that guidance leads to improvements in performance across datasets. We believe this approach is a step towards the training of offline agents on predominantly synthetic experience, minimizing the principal drawbacks of offline RL. |
Matthew T Jackson · Michael Matthews · Cong Lu · Jakob Foerster · Shimon Whiteson 🔗 |
-
|
$\texttt{PREMIER-TACO}$ is a Few-Shot Policy Learner: Pretraining Multitask Representation via Temporal Action-Driven Contrastive Loss
(
Poster
)
>
link
We introduce $\texttt{Premier-TACO}$, a novel multitask feature representation learning methodology aiming to enhance the efficiency of few-shot policy learning in sequential decision-making tasks. $\texttt{Premier-TACO}$ pretrains a general feature representation using a small subset of relevant multitask offline datasets, capturing essential environmental dynamics. This representation can then be fine-tuned to specific tasks with few expert demonstrations. Building upon the recent temporal action contrastive learning (TACO) objective, which obtains the state of art performance in visual control tasks, $\texttt{Premier-TACO}$ additionally employs a simple yet effective negative example sampling strategy. This key modification ensures computational efficiency and scalability for large-scale multitask offline pretraining. Experimental results from both Deepmind Control Suite and MetaWorld domains underscore the effectiveness of $\texttt{Premier-TACO}$ for pretraining visual representation, facilitating efficient few-shot imitation learning of unseen tasks. On the DeepMind Control Suite, $\texttt{Premier-TACO}$ achieves an average improvement of 101% in comparison to a carefully implemented Learn-from-scratch baseline, and a 24% improvement compared with the most effective baseline pretraining method. Similarly, on MetaWorld, $\texttt{Premier-TACO}$ obtains an average advancement of 74% against Learn-from-scratch and a 40% increase in comparison to the best baseline pretraining method.
|
Ruijie Zheng · Yongyuan Liang · Xiyao Wang · Shuang Ma · Hal Daumé III · Huazhe Xu · John Langford · Praveen Palanisamy · Kalyan Basu · Furong Huang 🔗 |
-
|
IG-Net: Image-Goal Network for Offline Visual Navigation on A Large-Scale Game Map
(
Poster
)
>
link
Navigating vast and visually intricate gaming environments poses unique challenges, especially when agents are deprived of absolute positions and orientations during testing. This paper addresses the challenge of training agents in such environments using a limited set of offline navigation data and a more substantial set of offline position data. We introduce the Image-Goal Network (IG-Net), an innovative solution tailored for these challenges. IG-Net is designed as an image-goal-conditioned navigation agent, which is trained end-to-end, directly outputting actions based on inputs without intermediary mapping steps. Furthermore, IG-Net harnesses position prediction, path prediction and distance prediction to bolster representation learning to encode spatial map information implicitly, an aspect overlooked in prior works. Results demonstrate IG-Net's potential in navigating large-scale gaming environments, providing both advancements in the field and tools for the broader research community. |
Pushi Zhang · Baiting Zhu · Xin-Qiang Cai · Li Zhao · Masashi Sugiyama · Jiang Bian 🔗 |
-
|
Robotic Task Generalization via Hindsight Trajectory Sketches
(
Poster
)
>
link
Generalization remains one of the most important desiderata for robust robot learning systems. While recently proposed approaches show promise in generalization to novel objects, semantic concepts, or visual distribution shifts, generalization to new tasks remains challenging. For example, a language-conditioned policy trained on pick-and-place tasks will not be able to generalize to a folding task, even if the arm trajectory of folding is similar to pick-and-place. Our key insight is that this kind of generalization becomes feasible if we represent the task through rough trajectory sketches. We propose a policy conditioning method using such rough trajectory sketches, which we call RT-Trajectory, that is practical, easy to specify, and allows the policy to effectively perform new tasks that would otherwise be challenging to perform. We find that trajectory sketches strike a balance between being detailed enough to express low-level motion-centric guidance while being coarse enough to allow the learned policy to interpret the trajectory sketch in the context of situational visual observations. In addition, we show how trajectory sketches can provide a useful interface to communicate with robotic policies -- they can be specified through simple human inputs like drawings or videos, or through automated methods such as modern image-generating or waypoint-generating methods. We evaluate RT-Trajectory at scale on a variety of real-world robotic tasks, and find that RT-Trajectory is able to perform a wider range of tasks compared to language-conditioned and goal-conditioned policies, when provided the same training data. |
Jiayuan Gu · Sean Kirmani · Paul Wohlhart · Yao Lu · Montserrat Gonzalez Arenas · Kanishka Rao · Wenhao Yu · Chuyuan Fu · Keerthana Gopalakrishnan · Zhuo Xu · Priya Sundaresan · Peng Xu · Hao Su · Karol Hausman · Chelsea Finn · Quan Vuong · Ted Xiao
|
-
|
Hybrid Inverse Reinforcement Learning
(
Poster
)
>
link
The inverse reinforcement learning approach to imitation learning is a double-edged sword. On one hand, it allows the learner to find policies that are robust to compounding errors. On the other hand, it requires that the learner repeatedly solve a computationally expensive reinforcement learning (RL) problem. Often, much of this computation is spent exploring parts of the state space the expert never visited and is therefore wasted. In this work, we propose using hybrid reinforcement learning to curtail this unnecessary exploration. More formally, we derive a reduction from inverse RL to hybrid RL that allows us to dramatically reduce interaction during the inner policy search loop while still maintaining a degree of robustness to compounding errors. Empirically, we find that our approaches are far more sample efficient than standard inverse RL and several other baselines that require stronger assumptions on a suite of continuous control tasks. |
Juntao Ren · Gokul Swamy · Steven Wu · J. Bagnell · Sanjiban Choudhury 🔗 |
-
|
RoboAgent: Towards Sample Efficient Robot Manipulation with Semantic Augmentations and Action Chunking
(
Poster
)
>
link
The grand aim of having a single robot that can manipulate arbitrary objects in diverse settings is at odds with the paucity of robotics datasets. Acquiring and growing such datasets is strenuous due to manual efforts, operational costs, and safety challenges. A path toward such a universal agent requires an efficient framework capable of generalization but within a reasonable data budget. In this paper, we develop an efficient framework (MT-ACT) for training universal agents capable of multi-task manipulation skills using (a) semantic augmentations that can rapidly multiply existing datasets and (b) action representations that can extract performant policies with small yet diverse multi-modal datasets without overfitting. In addition, reliable task conditioning and an expressive policy architecture enables our agent to exhibit a diverse repertoire of skills in novel situations specified using task commands. Using merely 7500 demonstrations, we are able to train a single policy RoboAgent capable of 12 unique skills, and demonstrate its generalization over 38 tasks spread across common daily activities in diverse kitchen scenes. On average, RoboAgent outperforms prior methods by over 40% in unseen situations while being more sample efficient. |
Homanga Bharadhwaj · Jay Vakil · Mohit Sharma · Abhinav Gupta · Shubham Tulsiani · Vikash Kumar 🔗 |
-
|
Adapt On-the-Go: Behavior Modulation for Single-Life Robot Deployment
(
Poster
)
>
link
To succeed in the real world, robots must cope with situations that differ from those seen during training. We study the problem of adapting on-the-fly to such novel scenarios during deployment, by drawing upon a diverse repertoire of previously- learned behaviors. Our approach, RObust Autonomous Modulation (ROAM), introduces a mechanism based on the perceived value of pre-trained behaviors to select and adapt pre-trained behaviors to the situation at hand. Crucially, this adaptation process all happens within a single episode at test time, without any human supervision. We provide theoretical analysis of our selection mechanism and demonstrate that ROAM enables a robot to adapt rapidly to changes in dynamics both in simulation and on a real Go1 quadruped, even successfully moving forward with roller skates on its feet. Our approach adapts over 2x as efficiently compared to existing methods when facing a variety of out-of-distribution situations during deployment by effectively choosing and adapting relevant behaviors on-the-fly. |
Annie Chen · Govind Chada · Laura Smith · Archit Sharma · Zipeng Fu · Sergey Levine · Chelsea Finn 🔗 |
-
|
D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation
(
Poster
)
>
link
Scene representation has been a crucial design choice in robotic manipulation systems. An ideal representation should be 3D, dynamic, and semantic to meet the demands of diverse manipulation tasks. However, previous works often lack all three properties simultaneously. In this work, we introduce D$^3$Fields — dynamic 3D descriptor fields. These fields capture the dynamics of the underlying 3D environment and encode both semantic features and instance masks. Specifically, we project arbitrary 3D points in the workspace onto multi-view 2D visual observations and interpolate features derived from foundational models. The resulting fused descriptor fields allow for flexible goal specifications using 2D images with varied contexts, styles, and instances. To evaluate the effectiveness of these descriptor fields, we apply our representation to a wide range of robotic manipulation tasks in a zero-shot manner. Through extensive evaluation in both real-world scenarios and simulations, we demonstrate that D$^3$Fields are both generalizable and effective for zero-shot robotic manipulation tasks. In quantitative comparisons with state-of-the-art dense descriptors, such as Dense Object Nets and DINO, D$^3$Fields exhibit significantly better generalization abilities and manipulation accuracy.
|
Yixuan Wang · Zhuoran Li · Mingtong Zhang · Katherine Driggs-Campbell · Jiajun Wu · Fei-Fei Li · Yunzhu Li 🔗 |
-
|
How to Prompt Your Robot: A PromptBook for Manipulation Skills with Code as Policies
(
Poster
)
>
link
Large Language Models (LLMs) have demonstrated the ability to perform semantic reasoning, planning and code writing for robotics tasks. However, most methods rely on pre-existing primitives (i.e. pick, open drawer), which heavily limits their scalability to new scenarios. Additionally, existing approaches like Code as Policies (CaP) rely on examples of robot code in the prompt to write code for new tasks, and assume that LLMs can infer task information, constraints, and API usage from examples alone. But examples can be costly, and too few or too many can bias the LLM in the wrong direction. Recent research has demonstrated prompting LLMs with APIs and documentation enables code writing for successful zero-shot tool use. However, documenting robotics tasks and naively providing full robot APIs presents a challenge to context-length limits in LLMs.In this work, we introduce PromptBook, a recipe that combines LLM prompting paradigms - examples, APIs, documentation and chain of thought, to generate code for planning a sorting task with higher success rate than previous works.We further demonstrate PromptBook enables LLMs to write code for new low-level manipulation primitives in a zero-shot manner: from picking diverse objects, opening/closing drawers, to whisking, and waving hello. We evaluate the new skills on a mobile manipulator with 83\% success rate at picking, 50-71\% at opening drawers and 100\% at closing them. Notably, the LLM is able to infer gripper orientation for grasping a drawer handle (z-axis aligned) vs. a top-down grasp (x-axis aligned). Finally, we provide guidelines to leverage human feedback and LLMs to write PromptBook prompts. |
Montserrat Gonzalez Arenas · Ted Xiao · Sumeet Singh · Vidhi Jain · Allen Z. Ren · Quan Vuong · Jake Varley · Alexander Herzog · Isabel Leal · Sean Kirmani · Dorsa Sadigh · Vikas Sindhwani · Kanishka Rao · Jacky Liang · Andy Zeng
|
-
|
How to Prompt Your Robot: A PromptBook for Manipulation Skills with Code as Policies
(
Spotlight
)
>
link
Large Language Models (LLMs) have demonstrated the ability to perform semantic reasoning, planning and code writing for robotics tasks. However, most methods rely on pre-existing primitives (i.e. pick, open drawer), which heavily limits their scalability to new scenarios. Additionally, existing approaches like Code as Policies (CaP) rely on examples of robot code in the prompt to write code for new tasks, and assume that LLMs can infer task information, constraints, and API usage from examples alone. But examples can be costly, and too few or too many can bias the LLM in the wrong direction. Recent research has demonstrated prompting LLMs with APIs and documentation enables code writing for successful zero-shot tool use. However, documenting robotics tasks and naively providing full robot APIs presents a challenge to context-length limits in LLMs.In this work, we introduce PromptBook, a recipe that combines LLM prompting paradigms - examples, APIs, documentation and chain of thought, to generate code for planning a sorting task with higher success rate than previous works.We further demonstrate PromptBook enables LLMs to write code for new low-level manipulation primitives in a zero-shot manner: from picking diverse objects, opening/closing drawers, to whisking, and waving hello. We evaluate the new skills on a mobile manipulator with 83\% success rate at picking, 50-71\% at opening drawers and 100\% at closing them. Notably, the LLM is able to infer gripper orientation for grasping a drawer handle (z-axis aligned) vs. a top-down grasp (x-axis aligned). Finally, we provide guidelines to leverage human feedback and LLMs to write PromptBook prompts. |
Montserrat Gonzalez Arenas · Ted Xiao · Sumeet Singh · Vidhi Jain · Allen Z. Ren · Quan Vuong · Jake Varley · Alexander Herzog · Isabel Leal · Sean Kirmani · Dorsa Sadigh · Vikas Sindhwani · Kanishka Rao · Jacky Liang · Andy Zeng
|
-
|
World Model Based Sim2Real Transfer for Visual Navigation
(
Poster
)
>
link
Sim2Real transfer has gained popularity because it helps transfer from inexpensive simulators to real world. This paper presents a novel system that fuses components in a traditional World Model into a robust system, trained entirely within a simulator, that Zero-Shot transfers to the real world. To facilitate transfer, we use an intermediary representation that are based on Bird's Eye View (BEV) images. Thus, our robot learns to navigate in a simulator by first learning to translate from complex First-Person View (FPV) based RGB images to BEV representations, then learning to navigate using those representations. Later, when tested in the real world, the robot uses the perception model that translates FPV-based RGB images to embeddings that are used by the downstream policy.The incorporation of state-checking modules using Anchor images and Mixture Density LSTM not only interpolates uncertain and missing observations but also enhances the robustness of the model when exposed to the real-world environment. We trained the model using data collected using a Differential-drive robot in the CARLA simulator. Our methodology's effectiveness is shown through the deployment of trained models onto a Real-world Differential-drive robot. Lastly we release a comprehensive codebase, dataset and models for training and deployment that are available to the public. |
Kiran Lekkala · Chen Liu · Laurent Itti 🔗 |
-
|
Robotic Offline RL from Internet Videos via Value-Function Pre-Training
(
Poster
)
>
link
Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. For robotics applications, data remains limited and video, the largest prior source of data available, offers observation-only experience without the action or reward annotations that cannot easily be incorporated in robotic learning methods. In this paper, we develop a system for leveraging large-scale human video datasets in robotic offline RL, based entirely on learning value functions via temporal-difference learning. We show that value learning on video datasets learns representations that are more conducive to downstream robotic offline RL than other approaches for learning from video data. Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting policies that perform better, act robustly, and generalize broadly. On several manipulation tasks on a real WidowX robot, our framework produces policies that greatly improve over prior methods. Videos can be found at https://sites.google.com/view/v-ptr. |
Chethan Bhateja · Derek Guo · Dibya Ghosh · Anikait Singh · Manan Tomar · Quan Vuong · Yevgen Chebotar · Sergey Levine · Aviral Kumar 🔗 |
-
|
RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot
(
Poster
)
>
link
A key challenge for robotic manipulation in open domains is how to acquire diverse and generalizable skills for robots. Recent progress in one-shot imitation learning and robotic foundation models have shown promise in transferring trained policies to new tasks based on demonstrations. This feature is attractive for enabling robots to acquire new skills and improve their manipulative ability. However, due to limitations in the training dataset, the current focus of the community has mainly been on simple cases, such as push or pick-place tasks, relying solely on visual guidance. In reality, there are many complex skills, some of which may even require both visual and tactile perception to solve. This paper aims to unlock the potential for an agent to generalize to hundreds of real-world skills with multi-modal perception. To achieve this, we have collected a dataset comprising over 110,000 contact-rich robot manipulation sequences across diverse skills, contexts, robots, and camera viewpoints, all collected in the real world. Each sequence in the dataset includes visual, force, audio, and action information. Moreover, we also provide a corresponding human demonstration video and a language description for each robot sequence. We have invested significant efforts in calibrating all the sensors and ensuring a high-quality dataset. The dataset is made publicly available. |
Hao-Shu Fang · Hongjie Fang · Zhenyu Tang · Jirong Liu · Chenxi Wang · Junbo Wang · Haoyi Zhu · Cewu Lu 🔗 |
-
|
RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot
(
Spotlight
)
>
link
A key challenge for robotic manipulation in open domains is how to acquire diverse and generalizable skills for robots. Recent progress in one-shot imitation learning and robotic foundation models have shown promise in transferring trained policies to new tasks based on demonstrations. This feature is attractive for enabling robots to acquire new skills and improve their manipulative ability. However, due to limitations in the training dataset, the current focus of the community has mainly been on simple cases, such as push or pick-place tasks, relying solely on visual guidance. In reality, there are many complex skills, some of which may even require both visual and tactile perception to solve. This paper aims to unlock the potential for an agent to generalize to hundreds of real-world skills with multi-modal perception. To achieve this, we have collected a dataset comprising over 110,000 contact-rich robot manipulation sequences across diverse skills, contexts, robots, and camera viewpoints, all collected in the real world. Each sequence in the dataset includes visual, force, audio, and action information. Moreover, we also provide a corresponding human demonstration video and a language description for each robot sequence. We have invested significant efforts in calibrating all the sensors and ensuring a high-quality dataset. The dataset is made publicly available. |
Hao-Shu Fang · Hongjie Fang · Zhenyu Tang · Jirong Liu · Chenxi Wang · Junbo Wang · Haoyi Zhu · Cewu Lu 🔗 |
-
|
Exploitation-Guided Exploration for Semantic Embodied Navigation
(
Poster
)
>
link
In the recent progress in embodied navigation, modular policies have emerged as a de facto framework. However, there is more to compositionality beyond the decomposition of the learning load into modular components. In this work, we investigate a principled way to syntactically combine these components. Particularly, we propose Exploitation-Guided Exploration (XGX) where separate modules for exploration and exploitation come together in a novel and intuitive manner. We configure the exploitation module to take over in the deterministic final steps of navigation i.e when the goal becomes visible. Crucially, an exploitation module teacher-forces the exploration module and continues driving an overridden policy optimization. XGX with effective decomposition and novel guidance, improves the state-of-the-art performance on the challenging object navigation task from 70% to 73%. Finally, we show sim-to-real transfer to robot hardware and XGX performs over two-fold better than the best baseline from simulation benchmarking.Project page: XGXvisnav.github.io |
Justin Wasserman · Girish Chowdhary · Abhinav Gupta · Unnat Jain 🔗 |
-
|
Exploitation-Guided Exploration for Semantic Embodied Navigation
(
Spotlight
)
>
link
In the recent progress in embodied navigation, modular policies have emerged as a de facto framework. However, there is more to compositionality beyond the decomposition of the learning load into modular components. In this work, we investigate a principled way to syntactically combine these components. Particularly, we propose Exploitation-Guided Exploration (XGX) where separate modules for exploration and exploitation come together in a novel and intuitive manner. We configure the exploitation module to take over in the deterministic final steps of navigation i.e when the goal becomes visible. Crucially, an exploitation module teacher-forces the exploration module and continues driving an overridden policy optimization. XGX with effective decomposition and novel guidance, improves the state-of-the-art performance on the challenging object navigation task from 70% to 73%. Finally, we show sim-to-real transfer to robot hardware and XGX performs over two-fold better than the best baseline from simulation benchmarking.Project page: XGXvisnav.github.io |
Justin Wasserman · Girish Chowdhary · Abhinav Gupta · Unnat Jain 🔗 |
-
|
A$^2$Nav: Action-Aware Zero-Shot Robot Navigation Using Vision-Language Ability of Foundation Models
(
Poster
)
>
link
We tackle the challenging task of zero-shot vision-and-language navigation (ZS-VLN), where an agent learns to follow complex path instructions without annotated data. We introduce A$^2$Nav, an action-aware ZS-VLN method leveraging foundation models like GPT and CLIP. Our approach includes an instruction parser and an action-aware navigation policy. The parser breaks down complex instructions into action-aware sub-tasks, which are executed using the learned action-specific navigation policy. Extensive experiments show A$^2$Nav achieves promising ZS-VLN performance and even surpasses some supervised learning methods on R2R-Habitat and RxR-Habitat datasets.
|
Peihao Chen · Xinyu Sun · Hongyan Zhi · Runhao Zeng · Thomas Li · Mingkui Tan · Chuang Gan 🔗 |
-
|
LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for Locomotion
(
Poster
)
>
link
Imitation Learning (IL) holds great promise for enabling agile locomotion inembodied agents. However, many existing locomotion benchmarks primarilyfocus on simplified toy tasks, often failing to capture the complexity of real-worldscenarios and steering research toward unrealistic domains. To advance research inIL for locomotion, we present a novel benchmark designed to facilitate rigorousevaluation and comparison of IL algorithms. This benchmark encompasses adiverse set of environments, including quadrupeds, bipeds, and musculoskeletalhuman models, each accompanied by comprehensive datasets, such as real noisymotion capture data, ground truth expert data, and ground truth sub-optimal data,enabling evaluation across a spectrum of difficulty levels. To increase the robustnessof learned agents, we provide an easy interface for dynamics randomization andoffer a wide range of partially observable tasks to train agents across differentembodiments. Finally, we provide handcrafted metrics for each task and ship ourbenchmark with state-of-the-art baseline algorithms to ease evaluation and enablefast benchmarking. |
Firas Al-Hafez · Davide Tateo · Jan Peters 🔗 |
-
|
Swarm-GPT: Combining Large Language Models with Safe Motion Planning for Robot Choreography Design
(
Poster
)
>
link
This paper presents Swarm-GPT, a system that integrates large language models (LLMs) with safe swarm motion planning—offering an automated and novel approach to deployable drone swarm choreography. Swarm-GPT enables users to automatically generate synchronized drone performances through natural language instructions. With an emphasis on safety and creativity, Swarm-GPT addresses a critical gap in the field of drone choreography by integrating the creative power of generative models with the effectiveness and safety of model-based planning algorithms. This goal is achieved by prompting the LLM to generate a unique set of waypoints based on extracted audio data. A trajectory planner processes these waypoints to guarantee collision-free and feasible motion. Results can be viewed in simulation prior to execution and modified through dynamic re-prompting. To date, Swarm-GPT has been successfully showcased at three live events, exemplifying safe real-world deployment of pre-trained models. |
Aoran Jiao · Tanmay Patel · Sanjmi Khurana · Anna-Mariya Korol · Lukas Brunke · Vivek Adajania · Utku Culha · SiQi Zhou · Angela Schoellig 🔗 |
-
|
Human Scene Transformer
(
Poster
)
>
link
In this work, we present a human-centric scene transformer to predict human future trajectories from input features including human positions, and 3D skeletal keypoints from onboard in-the-wild robot sensory information. The resulting model captures the inherent uncertainty for future human trajectory prediction and achieves state-of-the-art performance on common prediction benchmarks and a human tracking dataset captured from a mobile robot. Furthermore, we identify agents with limited historical data as a major contributor to error where our approach leverages multi-modal data to provide a error reduction of up-to 11\%. |
Tim Salzmann · Hao-Tien Lewis Chiang · Markus Ryll · Dorsa Sadigh · Carolina Parada · Alex Bewley 🔗 |
-
|
Low-Cost Exoskeletons for Learning Whole-Arm Manipulation in the Wild
(
Poster
)
>
link
While humans can use parts of their arms other than the hands for manipulations like gathering and supporting, whether robots can effectively learn and perform the same type of operations remains relatively unexplored. As these manipulations require joint-level control to regulate the complete poses of the robots, we develop AirExo, a low-cost, adaptable, and portable dual-arm exoskeleton, for teleoperation and demonstration collection. As collecting teleoperated data is expensive and time-consuming, we further leverage AirExo to collect cheap in-the-wild demonstrations at scale. Under our in-the-wild learning framework, we show that with only 3 minutes of the teleoperated demonstrations, augmented by diverse and extensive in-the-wild data collected by AirExo, robots can learn a policy that is comparable to or even better than one learned from teleoperated demonstrations lasting over 20 minutes. Experiments demonstrate that our approach enables the model to learn a more general and robust policy across the various stages of the task, enhancing the success rates in task completion even with the presence of disturbances. |
Hongjie Fang · Hao-Shu Fang · Yiming Wang · Jieji Ren · Jingjing Chen · Ruo Zhang · Weiming Wang · Cewu Lu 🔗 |
-
|
Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models
(
Poster
)
>
link
As autonomous driving technology matures, end-to-end methodologies have emerged as a leading strategy, promising seamless integration from perception to control via deep learning. However, existing systems grapple with challenges such as unexpected open set environments and the complexity of black-box models. At the same time, the evolution of deep learning introduces larger, multimodal foundational models, offering multi-modal visual and textual understanding. In this paper, we harness these multimodal foundation models to enhance the robustness and adaptability of autonomous driving systems. We introduce a method to extract nuanced spatial features from transformers and the incorporation of latent space simulation for improved training and policy debugging. We use pixel/patch-aligned feature descriptors to expand foundational model capabilities to create an end-to-end multimodal driving model, demonstrating unparalleled results in diverse tests. Our solution combines language with visual perception and achieves significantly greater robustness on out-of-distribution situations. Check our video at https://drive.google.com/file/d/1B8N7mUVsECkGfrEFRKJpgsOD8LXbcG9y/view?usp=sharing. |
Tsun-Hsuan Johnson Wang · Alaa Maalouf · Wei Xiao · Yutong Ban · Alexander Amini · Guy Rosman · Sertac Karaman · Daniela Rus 🔗 |
-
|
LLM Augmented Hierarchical Agents
(
Poster
)
>
link
Solving long-horizon, temporally-extended tasks using Reinforcement Learning (RL) is challenging, compounded by the common practice of learning without prior knowledge (or tabula rasa learning). Humans can generate and execute plans with temporally-extended actions and quickly learn to perform new tasks because we almost never solve problems from scratch. We want autonomous agents to have this same ability. Recently, LLMs have been shown to encode a tremendous amount of knowledge about the world and to perform impressive in-context learning and reasoning. However, using LLMs to solve real world tasks is hard because they are not grounded in the current task. In this paper we exploit the planning capabilities of LLMs while using RL to provide learning from the environment, resulting in a hierarchical agent that uses LLMs to solve long-horizon tasks. Instead of completely relying on LLMs, they guide a high-level policy, making learning significantly more sample efficient. This approach is evaluated in simulation environments such as MiniGrid, SkillHack, and Crafter, and on a real robot arm in block manipulation tasks. We show that agents trained using our approach outperform other baselines methods and, once trained, don't need access to LLMs during deployment. |
Bharat Prakash · Tim Oates · Tinoosh Mohsenin 🔗 |
-
|
Formalizing Lines of Research on Generalization in Deep Reinforcement Learning
(
Poster
)
>
link
Reinforcement learning research obtained significant success and attention with the utilization of deep neural networks to solve problems in high dimensional state or action spaces. While deep reinforcement learning policies are currently being deployed in many different fields from medical applications to self driving vehicles, there are still ongoing questions the field is trying to answer on the generalization capabilities of deep reinforcement learning policies. In this paper, we will go over the fundamental reasons why deep reinforcement learning policies encounter overfitting problems that limit their generalization capabilities. Furthermore, we will formalize and unify the manifold solution approaches to increase generalization, and overcome overfitting in deep reinforcement learning policies. We believe our study can provide a compact systematic unified analysis for the current advancements in deep reinforcement learning, and help to construct robust deep neural policies with improved generalization abilities. |
Ezgi Korkmaz 🔗 |
-
|
Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning
(
Poster
)
>
link
The pre-train and fine-tune approach in machine learning has been highly successful across various domains, enabling rapid task learning by utilizing existing data and pre-trained models from the internet. We seek to apply this approach to robotic reinforcement learning, allowing robots to learn new tasks with minimal human involvement by leveraging online resources. We introduce RoboFuME, a reset-free fine-tuning system that pre-trains a versatile manipulation policy from diverse prior experience datasets and autonomously learns a target task with minimal human input. In real-world robot manipulation tasks, our method can incorporate data from an external robot dataset and improve performance on a target task in as little as 3 hours of autonomous real-world experience. We also evaluate our method against various baselines in simulation experiments. Website: https://tinyurl.com/robofume |
Jingyun Yang · Max Sobol Mark · Brandon Vu · Archit Sharma · Jeannette Bohg · Chelsea Finn 🔗 |
-
|
Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning
(
Spotlight
)
>
link
The pre-train and fine-tune approach in machine learning has been highly successful across various domains, enabling rapid task learning by utilizing existing data and pre-trained models from the internet. We seek to apply this approach to robotic reinforcement learning, allowing robots to learn new tasks with minimal human involvement by leveraging online resources. We introduce RoboFuME, a reset-free fine-tuning system that pre-trains a versatile manipulation policy from diverse prior experience datasets and autonomously learns a target task with minimal human input. In real-world robot manipulation tasks, our method can incorporate data from an external robot dataset and improve performance on a target task in as little as 3 hours of autonomous real-world experience. We also evaluate our method against various baselines in simulation experiments. Website: https://tinyurl.com/robofume |
Jingyun Yang · Max Sobol Mark · Brandon Vu · Archit Sharma · Jeannette Bohg · Chelsea Finn 🔗 |
-
|
Vision-Language Models Provide Promptable Representations for Reinforcement Learning
(
Poster
)
>
link
Intelligent beings have the ability to quickly learn new behaviors and tasks by leveraging background world knowledge. We would like to endow RL agents with a similar ability to use contextual prior information. To this end, we propose a novel approach that uses the vast amounts of general-purpose, diverse, and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data to generate text in response to images and prompts. We initialize RL policies with VLMs by using such models as sources of \textit{promptable representations}: embeddings that are grounded in visual observations and encode semantic features based on the VLM's internal knowledge, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on RL tasks in Minecraft and find that policies trained on promptable embeddings significantly outperform equivalent policies trained on generic, non-promptable image encoder features and instruction-following methods. In ablations, we find that VLM promptability and text generation both are important in yielding good representations for RL. Finally, we give a simple method for evaluating prompts used by our approach without running expensive RL trials, ensuring that it extracts task-relevant semantic features from the VLM. |
William Chen · Oier Mees · Aviral Kumar · Sergey Levine 🔗 |
-
|
Trajeglish: Learning the Language of Driving Scenarios
(
Poster
)
>
link
A longstanding challenge for self-driving development is the ability to simulatedynamic driving scenarios seeded from recorded driving logs. Given an initialscene observed during a test drive, we seek the ability to sample plausible scene-consistent future trajectories for all agents in the scene, even when the actions for asubset of agents are chosen by an external source, such as a black-box self-drivingplanner. In order to model the complicated spatial and temporal interaction acrossagents in driving scenarios, we propose to tokenize the motion of dynamic agentsand use tools from language modeling to model the full sequence of multi-agentactions. Our traffic model explicitly captures intra-timestep dependence betweenagents, which is essential for simulation given only a single frame ofhistorical scene context, as well as enabling improvements when provided longerhistorical context. We demonstrate competitive results sampling scenarios giveninitializations from the Waymo Open Dataset with full autonomy and partialautonomy, highlighting the ability of our model to interact with agents outside its control. |
Jonah Philion · Xue Bin Peng · Sanja Fidler 🔗 |
-
|
Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models
(
Poster
)
>
link
If generalist robots are to operate in truly unstructured environments, they need to be able to recognize and reason about novel objects and scenarios. Such objects and scenarios might not be present in the robot's own training data. We propose SuSIE, a method that leverages an image editing diffusion model to act as a high-level planner by proposing intermediate subgoals that a low-level controller attains. Specifically, we fine-tune InstructPix2Pix on robot data such that it outputs a hypothetical future observation given the robot's current observation and a language command. We then use the same robot data to train a low-level goal-conditioned policy to reach a given image observation. We find that when these components are combined, the resulting system exhibits robust generalization capabilities. The high-level planner utilizes its Internet-scale pre-training and visual understanding to guide the low-level goal-conditioned policy, achieving significantly better generalization than conventional language-conditioned policies. We demonstrate that this approach solves real robot control tasks involving novel objects, distractors, and even environments, both in the real world and in simulation. The project website can be found at http://subgoal-image-editing.github.io. |
Kevin Black · Mitsuhiko Nakamoto · Pranav Atreya · Homer Walke · Chelsea Finn · Aviral Kumar · Sergey Levine 🔗 |
-
|
Causal Influence Aware Counterfactual Data Augmentation
(
Poster
)
>
link
Pre-recorded data and human demonstrations are practical resources for teaching robots complex behaviors.However, the combinatorial nature of real-world scenarios requires a huge amount of data to prevent neural network policies from picking up on spurious and non-causal factors.We propose CAIAC, a data augmentation method that creates synthetic samples from a fixed dataset without the need to perform new environment interactions.Motivated by the fact that an agent may only modify the environment through its actions, we swap causally action-unaffected parts of the state-space from different observed trajectories.In several environment benchmarks, we observe an increase in generalization capabilities and sample efficiency. |
Núria Armengol Urpí · Georg Martius 🔗 |
-
|
LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation
(
Poster
)
>
link
The convergence of embodied agents and large language models (LLMs) has brought significant advancements to embodied instruction following.Particularly, the strong reasoning capabilities of LLMs make it possible for robots to perform long-horizon tasks without expensive annotated demonstrations.However, public benchmarks for testing the long-horizon reasoning capabilities of language-conditioned robots in various scenarios are still missing. To fill this gap, this work focuses on the tabletopmanipulation task and releases a simulation benchmark,\textit{LoHoRavens}, which covers various long-horizonreasoning aspects spanning color, size, space, arithmeticsand reference.Furthermore, there is a key modality bridging problem forlong-horizon manipulation tasks with LLMs: how toincorporate the observation feedback during robot executionfor the LLM's closed-loop planning, which is however less studied by prior work. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM, respectively.These methods serve as the two baselines for our proposed benchmark. Experiments show that both methods struggle to solve most tasks, indicating long-horizon manipulation tasks are still challenging for current popular models.We expect the proposed public benchmark and baselines can help the community develop better models for long-horizon tabletop manipulation tasks. |
Shengqiang Zhang · Philipp Wicke · Lütfi Kerem Senel · Luis Figueredo · Abdeldjallil Naceri · Sami Haddadin · Barbara Plank · Hinrich Schuetze 🔗 |