Skip to yearly menu bar Skip to main content

Workshop: Workshop on robustness of zero/few-shot learning in foundation models (R0-FoMo)

Automatic Hallucination Assessment for Aligned Large Language Models via Transferable Adversarial Attacks

Xiaodong Yu · Hao Cheng · Xiaodong Liu · Dan Roth · Jianfeng Gao


Although remarkable progress has been achieved preventing LLMs hallucinations, using instruction tuning and retrieval augmentation, it is currently difficult to measure the reliability of LLMs using available static data that is often not challenging enough and could suffer from data leakage. Inspired by adversarial machine learning, this paper aims to develop an automatic method for generating new evaluation data by appropriately modifying existing data on which LLMs behave faithfully. Specifically, this paper presents AutoDebug, an LLM-based framework for using prompt chaining to generate transferable adversarial attacks (in the form of question-answering examples). We seek to understand the extent to which these trigger hallucination behavior in LLMs. We first implement our framework using ChatGPT and evaluate the resulting two variants of a popular open-domain question-answering dataset, Natural Questions (NQ) on a collection of open-source and proprietary LLMs under various prompting settings. Our generated evaluation data is human-readable and, as we show, humans can answer these modified questions well. Nevertheless, we observe pronounced accuracy drops across multiple LLMs including GPT-4. Our experimental results confirm that LLMs are likely to hallucinate in two categories of question-answering scenarios where (1) there are conflicts between knowledge given in the prompt and their parametric knowledge, or (2) the knowledge expressed in the prompt is complex. Finally, the adversarial examples generated by the proposed method are transferrable across all considered LLMs, making our approach viable for LLM-based debugging using more cost-effective LLMs.

Chat is not available.