NeurIPS Poster APIGen: Automated PIpeline for Generating Verifiable and Diverse Function-Calling Datasets

Poster

APIGen: Automated PIpeline for Generating Verifiable and Diverse Function-Calling Datasets

Zuxin Liu · Thai Hoang · Jianguo Zhang · Ming Zhu · Tian Lan · Shirley kokane · Juntao Tan · Weiran Yao · Zhiwei Liu · Yihao Feng · Rithesh R N · Liangwei Yang · Silvio Savarese · Juan Carlos Niebles · Huan Wang · Shelby Heinecke · Caiming Xiong

West Ballroom A-D #5307

[ Abstract ] [ Project Page ]

[ Paper] [ Slides]

Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, improving its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains. The dataset and models are available on the project homepage \url{https://apigen-pipeline.github.io/}.

Chat is not available.