Poster
in
Workshop: Synthetic Data for Empowering ML Research

Importance of Synthesizing High-quality Data for Text-to-SQL Parsing

Yiyun Zhao ⋅ Jiarong Jiang ⋅ Yiqun Hu ⋅ Wuwei Lan ⋅ Henghui Zhu ⋅ Anuj Chauhan ⋅ Hanbo Li ⋅ Lin Pan ⋅ Jun Wang ⋅ Chung-Wei Hang ⋅ Sheng Zhang ⋅ Mingwen Dong ⋅ Joseph Lilien ⋅ Patrick Ng ⋅ Zhiguo Wang ⋅ Vittorio Castelli ⋅ Bing Xiang

2022 Poster
in
Workshop: Synthetic Data for Empowering ML Research

Project Page [ OpenReview]

Abstract

There has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed two shortcomings: illogical synthetic SQL queries from independent column sampling and arbitrary table joins. To address these issues, we propose a novel synthesis framework that incorporates key relationships from schema, imposes strong typing, and conducts schema-distance-weighted column sampling. We also adopt an intermediate representation (IR) for the SQL-to-text task to further improve the quality of the generated natural language questions. When existing powerful semantic parsers are pre-finetuned on our high-quality synthesized data, these models have significant accuracy boosts and achieve new state-of-the-art performance on Spider.

Video

Chat is not available.