NeurIPS Evaluating Adversarial Defense in the Era of Large Language Models

Spotlight
in
Workshop: Workshop on robustness of zero/few-shot learning in foundation models (R0-FoMo)

Evaluating Adversarial Defense in the Era of Large Language Models

Joachim Studnia · Simiao Zuo · Xiaodong Liu · Qiang Lou · Jian Jiao · Denis Charles

[ Abstract ]

Abstract:

Large language models (LLMs) have demonstrated superior performance in many natural language processing tasks. Existing works have shown that LLMs are not robust to adversarial attacks, questioning the applicability of these models in scenarios with safety concerns. However, one key aspect that has been overlooked is evaluating and developing defense mechanisms against adversarial attacks.In this work, we systematically study how LLMs react to different adversarial defense strategies. We also propose defenses tailored for LLMs that can significantly improve their robustness: First, we develop prompting methods to alert the LLM about potential adversarial contents; Second, we use neural models such as the LLM itself for typo correction; Third, we propose an effective fine-tuning scheme to improve robustness against corrupted inputs.Extensive experiments are conducted to evaluate the adversarial defense approaches. We show that by using the proposed defenses, robustness of LLMs can increase by up to 20\%.

Chat is not available.

Spotlight in Workshop: Workshop on robustness of zero/few-shot learning in foundation models (R0-FoMo)

Evaluating Adversarial Defense in the Era of Large Language Models

Joachim Studnia · Simiao Zuo · Xiaodong Liu · Qiang Lou · Jian Jiao · Denis Charles

Spotlight
in
Workshop: Workshop on robustness of zero/few-shot learning in foundation models (R0-FoMo)