Skip to yearly menu bar Skip to main content

Workshop: Synthetic Data Generation with Generative AI

CALICO: Conversational Agent Localization via Synthetic Data Generation

Andy Rosenbaum · Ershad Banijamali · Christopher DiPersio · Pegah Kharazmi · Pan Wei · Lu Zeng · Gokmen Oz · Wael Hamza · Clement Chung · Karolina Owczarzak · Fabian Triefenbach

Keywords: [ Natural Language Processing ] [ Large Langauge Models ] [ Synthetic Data Generation ] [ Multilingual ]


We present CALICO, a method to fine-tune Large Language Models (LLMs) to localize conversational agent training data from one language to another. For named-entities, CALICO supports three operations: verbatim copy, literal transla- tion, and localization, i.e. generating entity values more appropriate in the target language, such as city and airport names located in countries where the language is spoken. To prove the effectiveness of CALICO, we build and release a new human-localized (HL) version of the MultiATIS++ travel information test set in 6 languages. Compared to the original human-translated (HT) version of the test set, we show that our new HL version is more challenging. We also show that CALICO out-performs state-of-the-art LINGUIST (which relies on literal slot translation out of context) both on the HT case, where CALICO generates more accurate slot translations, and on the HL case, where CALICO generates localized entities which are closer to the HL test set.

Chat is not available.