Poster
in
Workshop: Workshop on Open-World Agents: Synnergizing Reasoning and Decision-Making in Open-World Environments (OWA-2024)

Fine-Tuning Web Agents: It Works, But It's Trickier Than You Think

Massimo Caccia ⋅ Megh Thakkar ⋅ Léo Boisvert ⋅ Thibault de Chezelles ⋅ Alexandre Piche ⋅ Nicolas Chapados ⋅ Alexandre Drouin ⋅ Maxime Gasse ⋅ Alexandre Lacoste

Keywords: LLM fine-tuning web agents behavior cloning

Project Page [ OpenReview]

Abstract

Recent advancements in large language models (LLMs) have sparked interest in developing autonomous web agents capable of performing digital tasks through web interfaces in a human-like manner. However, even the strongest closed-source models often struggle to achieve robust results on several benchmarks, while a notable performance gap exists between them and open-source counterparts. This study investigates the potential of fine-tuning to enhance the performance of a smaller, lower-performing but cost-efficient LLM by leveraging successful traces from stronger LLMs, referred to as experts. We outline a comprehensive pipeline for data collection, filtering, and supervised fine-tuning and explore various behavior cloning parameters. Our experiments provide key insights into the challenges of fine-tuning LLMs into web agents on benchmarks like MiniWoB and WorkArena. Notably, we find that the fine-tuned agents' ability to predict expert trajectories does not consistently lead to improved downstream task performance. This raises issues such as off-policy bias and the loss of reasoning abilities during fine-tuning. We discuss potential solutions to these challenges and make both the codebase and a dataset of 140M tokens open-source for the community to build upon.

Chat is not available.