Workshop
Backdoors in Deep Learning: The Good, the Bad, and the Ugly
Khoa D Doan · Aniruddha Saha · Anh Tran · Yingjie Lao · Kok-Seng Wong · Ang Li · HARIPRIYA HARIKUMAR · Eugene Bagdasaryan · Micah Goldblum · Tom Goldstein
Room 203 - 205
Deep neural networks (DNNs) are revolutionizing almost all AI domains and have become the core of many modern AI systems. While having superior performance compared to classical methods, DNNs are also facing new security problems, such as adversarial and backdoor attacks, that are hard to discover and resolve due to their black-box-like property. Backdoor attacks, particularly, are a brand-new threat that was only discovered in 2017 but has gained attention quickly in the research community. The number of backdoor-related papers grew from 21 to around 110 after only one year (2019-2020). In 2022 alone, there were more than 200 papers on backdoor learning, showing a high research interest in this domain.Backdoor attacks are possible because of insecure model pretraining and outsourcing practices. Due to the complexity and the tremendous cost of collecting data and training models, many individuals/companies just employ models or training data from third parties. Malicious third parties can add backdoors into their models or poison their released data before delivering it to the victims to gain illegal benefits. This threat seriously damages the safety and trustworthiness of AI development. Lately, many studies on backdoor attacks and defenses have been conducted to prevent this critical vulnerability.While most works consider backdoor ``evil'', some studies exploit it for good purposes. A popular approach is to use the backdoor as a watermark to detect illegal use of commercialized data/models. A few works employ the backdoor as a trapdoor for adversarial defense. Learning the working mechanism of backdoor also elevates a deeper understanding of how deep learning models work.This workshop is designed to provide a comprehensive understanding of the current state of backdoor research. We also want to raise awareness of the AI community on this important security problem, and motivate researchers to build safe and trustful AI systems.
Schedule
Fri 7:00 a.m. - 7:30 a.m.
|
A Blessing in Disguise: Backdoor Attacks as Watermarks for Dataset Copyright Protection
(
Invited Talk
)
>
SlidesLive Video |
Yiming Li 🔗 |
Fri 7:30 a.m. - 8:00 a.m.
|
Recent Advances in Backdoor Defense and Benchmark
(
Invited Talk
)
>
SlidesLive Video In this talk, I will firstly introduce recent advances of backdoor defense, covering poisoned sample detection based defense at the pre-training stage, secure training based defense at the in-training stage, and backdoor mitigation based defense at the post-training stage. Then, I will introduce BackdoorBench, which is a comprehensive benchmark containing 30+ mainstream backdoor attack and defense methods, 10,000 pairs of attack-defense evaluations, as well as several interesting findings and analysis with 15+ analysis tools. The benchmark has been released at https://backdoorbench.github.io/. |
Baoyuan Wu 🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
COFFEE BREAK
(
COFFEE BREAK
)
>
|
🔗 |
Fri 8:30 a.m. - 9:00 a.m.
|
Invited Talk
(
Invited Talk
)
>
SlidesLive Video |
Jonas Geiping 🔗 |
Fri 9:00 a.m. - 9:15 a.m.
|
Effective Backdoor Mitigation Depends on the Pre-training Objective
(
Oral
)
>
link
SlidesLive Video Despite the remarkable capabilities of current machine learning (ML) models, they are still susceptible to adversarial and backdoor attacks. Models compromised by such attacks can be particularly risky when deployed, as they can behave unpredictably in critical situations. Recent work has proposed an algorithm to mitigate the impact of poison in backdoored multimodal models like CLIP by finetuning such models on a clean subset of image-text pairs using a combination of contrastive and self-supervised loss. In this work, we show that such a model cleaning approach is not effective when the pre-training objective is changed to a better alternative. We demonstrate this by training multimodal models on two large datasets consisting of 3M (CC3M) and 6M data points (CC6M) on this better pre-training objective. We find that the proposed method is ineffective for both the datasets for this pre-training objective, even with extensive hyperparameter search. Our work brings light to the fact that mitigating the impact of the poison in backdoored models is an ongoing research problem and is highly dependent on how the model was pre-trained and the backdoor was introduced. |
Sahil Verma · Gantavya Bhatt · Soumye Singhal · Arnav Das · Chirag Shah · John Dickerson · Jeff A Bilmes 🔗 |
Fri 9:15 a.m. - 9:45 a.m.
|
Universal jailbreak backdoors from poisoned human feedback
(
Invited Talk
)
>
SlidesLive Video Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. We consider the problem of poisoning the RLHF data to embed a backdoor trigger into the model. The trigger should act like a universal "sudo" command, enabling arbitrary harmful responses at test time. Universal jailbreak backdoors are much more powerful than previously studied backdoors on language models, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors. |
Florian Tramer 🔗 |
Fri 9:45 a.m. - 11:00 a.m.
|
LUNCH BREAK
(
LUNCH BREAK
)
>
|
🔗 |
Fri 11:00 a.m. - 11:15 a.m.
|
VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models
(
Oral
)
>
link
SlidesLive Video Diffusion Models (DMs) are state-of-the-art generative models that learn a reversible corruption process from iterative noise addition and denoising. They are the backbone of many generative AI applications, such as text-to-image conditional generation. However, recent studies have shown that basic unconditional DMs (e.g., DDPM and DDIM) are vulnerable to backdoor injection, a type of output manipulation attack triggered by a maliciously embedded pattern at model input. This paper presents a unified backdoor attack framework (VillanDiffusion) to expand the current scope of backdoor analysis for DMs. Our framework covers mainstream unconditional and conditional DMs (denoising-based and score-based) and various training-free samplers for holistic evaluations. Experiments show that our unified framework facilitates the backdoor analysis of different DM configurations and provides new insights into caption-based backdoor attacks on DMs. |
Sheng-Yen Chou · Pin-Yu Chen · Tsung-Yi Ho 🔗 |
Fri 11:15 a.m. - 11:30 a.m.
|
The Stronger the Diffusion Model, the Easier the Backdoor: Data Poisoning to Induce Copyright Breaches Without Adjusting Finetuning Pipeline
(
Oral
)
>
link
Diffusion models (DM) have increasingly demonstrated an ability to generate high-quality images, often indistinguishable from real ones. However, their complexity and vast parameter space have introduced potential copyright concerns. While measures have been introduced to prevent unauthorized access to copyrighted material, the efficacy of these solutions remains unverified. In this study, we examine the vulnerabilities associated with copyright in DMs, concentrating on the influence of backdoor data poisoning attacks during further fine-tuning on public datasets. We introduce \textbf{\ourmethod}, an innovative method for embedding backdoor data poisonings tailored for DMs. This method allows the finetuned models to recreate copyrighted images in response to particular trigger prompts by embedding components of copyrighted images across various images inconspicuously. In the inference process, DMs utilize their understanding of these prompts to regenerate the copyrighted images. Our empirical results indicate that the information of copyrighted data can be stealthily encoded into training data using \ourmethod, causing the fine-tuned DM to generate infringing content. These findings underline potential pitfalls in the prevailing copyright protection strategies and underscore the necessity for increased scrutiny and preventative measures against misuse of DMs. |
Haonan Wang · Qianli Shen · Yao Tong · Yang Zhang · Kenji Kawaguchi 🔗 |
Fri 11:30 a.m. - 12:00 p.m.
|
Is this model mine? On stealing and defending machine learning models.
(
Invited Talk
)
>
SlidesLive Video |
Adam Dziedzic 🔗 |
Fri 12:00 p.m. - 12:30 p.m.
|
Invited Talk
(
Invited Talk
)
>
SlidesLive Video |
Ruoxi Jia 🔗 |
Fri 12:30 p.m. - 1:00 p.m.
|
COFFEE BREAK
(
COFFEE BREAK
)
>
|
🔗 |
Fri 1:00 p.m. - 1:45 p.m.
|
On the Limitation of Backdoor Detection Methods
(
Poster
)
>
link
We introduce a formal statistical definition for the problem of backdoor detection in machine learning systems and use it analyze the feasibility of such problem, providing evidence for the utility and applicability of our definition. The main contributions of this work are an impossibility result and an achievability results for backdoor detection. We show a no-free-lunch theorem, proving that universal backdoor detection is impossible, except for very small alphabet sizes. Furthermore, we link our definition to the probably approximately correct (PAC) learnability of the out-of-distribution detection problem, establishing a formal connections between backdoor and out-of-distribution detection. |
Georg Pichler · Marco Romanelli · Divya Prakash Manivannan · Prashanth Krishnamurthy · Farshad Khorrami · Siddharth Garg 🔗 |
Fri 1:00 p.m. - 1:45 p.m.
|
How to remove backdoors in diffusion models?
(
Poster
)
>
link
Diffusion models (DM) have become state-of-the-art generative models because of their capability of generating high-quality images from noises without adversarial training. However, they are vulnerable to backdoor attacks as reported by recent studies. When a data input (e.g., some Gaussian noise) is stamped with a trigger (e.g., a white patch), the backdoored model always generates the target image (e.g., an improper photo). However, effective defense strategies to mitigate backdoors from DMs are underexplored. To bridge this gap, we propose the first backdoor detection and removal framework for DMs. We evaluate our framework on over hundreds of DMs of 3 types including DDPM, NCSN and LDM, with 13 samplers against 3 existing backdoor attacks. Extensive experiments show that our approach can have close to 100% detection accuracy and reduce the backdoor effects to close to zero without significantly sacrificing the model utility. |
Shengwei An · Sheng-Yen Chou · Kaiyuan Zhang · Qiuling Xu · Guanhong Tao · Guangyu Shen · Siyuan Cheng · Shiqing Ma · Pin-Yu Chen · Tsung-Yi Ho · Xiangyu Zhang
|
Fri 1:00 p.m. - 1:45 p.m.
|
Adversarial Robustness Unhardening via Backdoor Attacks in Federated Learning
(
Poster
)
>
link
In today's data-driven landscape, the delicate equilibrium between safeguarding user privacy and unleashing data potential stands as a paramount concern. Federated learning, which enables collaborative model training without necessitating data sharing, has emerged as a privacy-centric solution. This decentralized approach brings forth security challenges, notably poisoning and backdoor attacks where malicious entities inject corrupted data. Our research, initially spurred by test-time evasion attacks, investigates the intersection of adversarial training and backdoor attacks within federated learning, introducing Adversarial Robustness Unhardening (ARU). ARU is employed by a subset of adversaries to intentionally undermine model robustness during decentralized training, rendering models susceptible to a broader range of evasion attacks. We present extensive empirical experiments evaluating ARU's impact on adversarial training and existing robust aggregation defenses against poisoning and backdoor attacks. Our findings inform strategies for enhancing ARU to counter current defensive measures and highlight the limitations of existing defenses, offering insights into bolstering defenses against ARU. |
Taejin Kim · Jiarui Li · Nikhil Madaan · Shubhranshu Singh · Carlee Joe-Wong 🔗 |
Fri 1:00 p.m. - 1:45 p.m.
|
How to Backdoor HyperNetwork in Personalized Federated Learning?
(
Poster
)
>
link
This paper explores previously unknown backdoor risks in HyperNet-based personalized federated learning (HyperNetFL) through poisoning attacks. Based upon that, we propose a novel model transferring attack (called HNTroj), i.e., the first of its kind, to transfer a local backdoor infected model to all legitimate and personalized local models, which are generated by the HyperNetFL model, through consistent and effective malicious local gradients computed across all compromised clients in the whole training process. As a result, HNTroj reduces the number of compromised clients needed to successfully launch the attack without any observable signs of sudden shifts or degradation regarding model utility on legitimate data samples, making our attack stealthy. To defend against HNTroj, we adapted several backdoor-resistant FL training algorithms into HyperNetFL. An extensive experiment that is carried out using several benchmark datasets shows that HNTroj significantly outperforms data poisoning and model replacement attacks and bypasses robust training algorithms even with modest numbers of compromised clients. |
Phung Lai · Hai Phan · Issa Khalil · Abdallah Khreishah · Xintao Wu 🔗 |
Fri 1:00 p.m. - 1:45 p.m.
|
Universal Trojan Signatures in Reinforcement Learning
(
Poster
)
>
link
We present a novel approach for characterizing Trojaned reinforcement learning (RL) agents. By monitoring for discrepancies in how an agent's policy evaluates state observations for choosing an action, we can reliably detect whether the policy is Trojaned. Experiments on the IARPA RL challenge benchmarks show that our approach can effectively detect Trojaned models even in transfer settings with novel RL environments and modified architectures. |
Manoj Acharya · Weichao Zhou · Anirban Roy · Xiao Lin · Wenchao Li · Susmit Jha 🔗 |
Fri 1:00 p.m. - 1:45 p.m.
|
Analyzing And Editing Inner Mechanisms of Backdoored Language Models
(
Poster
)
>
link
Poisoning of data sets is a potential security threat to large language models that can lead to backdoored models. A description of the internal mechanisms of backdoored language models and how they process trigger inputs, e.g., when switching to toxic language, has yet to be found. In this work, we study the internal representations of transformer-based backdoored language models and determine early-layer MLP modules as most important for the backdoor mechanism in combination with the initial embedding projection. We use this knowledge to remove, insert, and modify backdoor mechanisms with engineered replacements that reduce the MLP module outputs to essentials for the backdoor mechanism. To this end, we introduce PCP ablation, where we replace transformer modules with low-rank matrices based on the principal components of their activations. We demonstrate our results on backdoored toy, backdoored large, and non-backdoored open-source models. We show that we can improve the backdoor robustness of large language models by locally constraining individual modules during fine-tuning on potentially poisonous data sets.Trigger warning: Offensive language. |
Max Lamparth · Ann-Katrin Reuel 🔗 |
Fri 1:00 p.m. - 1:45 p.m.
|
Detecting Backdoors with Meta-Models
(
Poster
)
>
link
It is widely known that it is possible to implant backdoors into neural networks,by which an attacker can choose an input to produce a particular undesirable output (e.g.\ misclassify an image).We propose to use \emph{meta-models}, neural networks that take another network's parameters as input, to detect backdoors directly from model weights.To this end we present a meta-model architecture and train it on a dataset of approx.\ 4000 clean and backdoored CNNs trained on CIFAR-10.Our approach is simple and scalable, and is able to detect the presence of a backdoor with $>99\%$ accuracy when the test trigger pattern is i.i.d., with some success even on out-of-distribution backdoors.
|
Lauro Langosco · Neel Alex · William Baker · David Quarel · Herbie Bradley · David Krueger 🔗 |
Fri 1:00 p.m. - 1:45 p.m.
|
Benchmark Probing: Investigating Data Leakage in Large Language Models
(
Poster
)
>
link
Large language models have consistently demonstrated exceptional performance across a wide range of natural language processing tasks. However, concerns have been raised about whether LLMs rely on benchmark data during their training phase, potentially leading to inflated scores on these benchmarks. This phenomenon, known as data contamination, presents a significant challenge within the context of LLMs. In this paper, we present a novel investigation protocol named $\textbf{T}$estset $\textbf{S}$lot Guessing ($\textbf{TS-Guessing}$) on knowledge-required benchmark MMLU and TruthfulQA, designed to estimate the contamination of emerging commercial LLMs. We divide this protocol into two subtasks: (i) $\textit{Question-based}$ setting: guessing the missing portions for long and complex questions in the testset (ii) $\textit{Question-Multichoice}$ setting: guessing the missing option given both complicated questions and options. We find that commercial LLMs could surprisingly fill in the absent data and demonstrate a remarkable increase given additional metadata (from 22.28\% to 42.19\% for Claude-instant-1 and from 17.53\% to 29.49\% for GPT-4).
|
Chunyuan Deng · Yilun Zhao · Xiangru Tang · Mark Gerstein · Arman Cohan 🔗 |
Fri 1:00 p.m. - 1:45 p.m.
|
Leveraging Diffusion-Based Image Variations for Robust Training on Poisoned Data
(
Poster
)
>
link
Backdoor attacks pose a serious security threat for training neural networks as they surreptitiously introduce hidden functionalities into a model. Such backdoors remain silent during inference on clean inputs, evading detection due to inconspicuous behavior. However, once a specific trigger pattern appears in the input data, the backdoor activates, causing the model to execute its concealed function. Detecting such poisoned samples within vast datasets is virtually impossible through manual inspection. To address this challenge, we propose a novel approach that enables model training on potentially poisoned datasets by utilizing the power of recent diffusion models. Specifically, we create synthetic variations of all training samples, leveraging the inherent resilience of diffusion models to potential trigger patterns in the data. By combining this generative approach with knowledge distillation, we produce student models that maintain their general performance on the task while exhibiting robust resistance to backdoor triggers. |
Lukas Struppek · Martin Bernhard Hentschel · Clifton Poth · Dominik Hintersdorf · Kristian Kersting 🔗 |
Fri 1:00 p.m. - 1:45 p.m.
|
$D^3$: Detoxing Deep Learning Dataset
(
Poster
)
>
link
Data poisoning is a prominent threat to Deep Learning applications. In backdoor attack, training samples are poisoned with a specific input pattern or transformation called trigger such that the trained model misclassifies in the presence of trigger.Despite a broad spectrum of defense techniques against data poisoning and backdoor attacks, these defenses are often outpaced by the increasing complexity and sophistication of attacks. In response to this growing threat, this paper introduces $D^3$, a novel dataset detoxification technique that leverages differential analysis methodology to extract triggers from compromised test samples captured in the wild. Specifically, we formulate the challenge of poison extraction as a constrained optimization problem and use iterative gradient descent with semantic restrictions. Upon successful extraction, $D^3$ enhances the dataset by incorporating the poison into clean validation samples and builds a classifier to separate clean and poisoned training samples. This post-mortem approach provides a robust complement to existing defenses, particularly when they fail to detect complex, stealthy poisoning attacks. $D^3$ is evaluated on 42 poisoned datasets with 18 different types of poisons, including the subtle clean-label poisoning, dynamic attack, and input-aware attack. It achieves over 95\% precision and 95\% recall on average, substantially outperforming the state-of-the-art.
|
Lu Yan · Siyuan Cheng · Guangyu Shen · Guanhong Tao · Xuan Chen · Kaiyuan Zhang · Yunshu Mao · Xiangyu Zhang 🔗 |
Fri 1:00 p.m. - 1:45 p.m.
|
Defending Our Privacy With Backdoors
(
Poster
)
>
link
The proliferation of large AI models trained on uncurated, often sensitive web-scraped data has raised significant privacy concerns. One of the concerns is that adversaries can extract information about the training data using privacy attacks. Unfortunately, the task of removing specific information from the models without sacrificing performance is not straightforward and has proven to be challenging.We propose a rather easy yet effective defense based on backdoor attacks to remove private information such as names of individuals from models, and focus in this work on text encoders. Specifically, through strategic insertion of backdoors, we align the embeddings of sensitive phrases with those of neutral terms--“a person” instead of the person's name.Our empirical results demonstrate the effectiveness of our backdoor-based defense on CLIP by assessing its performance using a specialized privacy attack for zero-shot classifiers.Our approach provides not only a new “dual-use” perspective on backdoor attacks, but also presents a promising avenue to enhance the privacy of individuals within models trained on uncurated web-scraped data. |
Dominik Hintersdorf · Lukas Struppek · Daniel Neider · Kristian Kersting 🔗 |
Fri 1:00 p.m. - 1:45 p.m.
|
Clean-label Backdoor Attacks by Selectively Poisoning with Limited Information from Target Class
(
Poster
)
>
link
Deep neural networks have been shown to be vulnerable to backdoor attacks, in which the adversary manipulates the training dataset to mislead the model when the trigger appears, while it still behaves normally on benign data. Clean label attacks can succeed without modifying the semantic label of poisoned data, which are more stealthy but, on the other hand, are more challenging. To control the victim model, existing works focus on adding triggers to a random subset of the dataset, neglecting the fact that samples contribute unequally to the success of the attack and, therefore do not exploit the full potential of the backdoor. Some recent studies propose different strategies to select samples by recording the forgetting events or looking for hard samples with a supervised trained model. However, these methods require training and assume that the attacker has access to the whole labeled training set, which is not always the case in practice. In this work, we consider a more practical setting where the attacker only provides a subset of the dataset with the target label and has no knowledge of the victim model, and propose a method to select samples to poison more effectively. Our method takes advantage of pretrained self-supervised models, therefore incurs no extra computational cost for training, and can be applied to any victim model. Experiments on benchmark datasets illustrate the effectiveness of our strategy in improving clean-label backdoor attacks. Our strategy helps SIG reach 91\% success rate with only 10\% poisoning ratio. |
Nguyen Hung-Quang · Ngoc-Hieu Nguyen · The Anh Ta · Thanh Nguyen-Tang · Hoang Thanh-Tung · Khoa D Doan 🔗 |
Fri 1:00 p.m. - 1:45 p.m.
|
BadFusion: 2D-Oriented Backdoor Attacks against 3D Object Detection
(
Poster
)
>
link
3D object detection plays an important role in autonomous driving; however, its vulnerability to backdoor attacks has become evident. By injecting ''triggers'' to poison the training dataset, backdoor attacks manipulate the detector's prediction for inputs containing these triggers. Existing backdoor attacks against 3D object detection primarily poison 3D LiDAR signals, where large-sized 3D triggers are injected to ensure their visibility within the sparse 3D space, rendering them easy to detect and impractical in real-world scenarios. In this paper, we delve into the robustness of 3D object detection, exploring a new backdoor attack surface through 2D cameras. Given the prevalent adoption of camera and LiDAR signal fusion for high-fidelity 3D perception, we investigate the latent potential of camera signals to disrupt the process. Although the dense nature of camera signals enables the use of nearly imperceptible small-sized triggers to mislead 2D object detection, realizing 2D-oriented backdoor attacks against 3D object detection is non-trivial. The primary challenge emerges from the fusion process that transforms camera signals into a 3D space, thereby compromising the association with the 2D trigger to the target output. To tackle this issue, we propose an innovative 2D-oriented backdoor attack against LiDAR-camera fusion methods for 3D object detection, named BadFusion, aiming to uphold trigger effectiveness throughout the entire fusion process. Extensive experiments validate the effectiveness of BadFusion, achieving a significantly higher attack success rate compared to existing 2D-oriented attacks. |
Saket Sanjeev Chaturvedi · Lan Zhang · Wenbin Zhang · Pan He · Xiaoyong Yuan 🔗 |
Fri 1:00 p.m. - 1:45 p.m.
|
Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks
(
Poster
)
>
link
Growing applications of large language models (LLMs) trained by a third party raise serious concerns on the security vulnerability of LLMs. It has been demonstrated that malicious actors can covertly exploit these vulnerabilities in LLMs through poisoning attacks aimed at generating undesirable outputs. While poisoning attacks have received significant attention in the image domain (e.g., object detection), and classification tasks, their implications for generative models, particularly in the realm of natural language generation (NLG) tasks, remain poorly understood. To bridge this gap, we perform a comprehensive exploration of various poisoning techniques to assess their effectiveness across a range of generative tasks. Furthermore, we introduce a range of metrics designed to quantify the success and stealthiness of poisoning attacks specifically tailored to NLG tasks. Through extensive experiments on multiple NLG tasks, LLMs and datasets, we show that it is possible to successfully poison an LLM during the fine-tuning stage using as little as 1\% of the total tuning data samples. Our paper presents the first systematic approach to comprehend poisoning attacks targeting NLG tasks considering a wide range of triggers and attack settings. We hope our findings will assist the AI security community in devising appropriate defenses against such threats. |
Shuli Jiang · Swanand Kadhe · Yi Zhou · Ling Cai · Nathalie Baracaldo 🔗 |
Fri 1:00 p.m. - 1:45 p.m.
|
From Trojan Horses to Castle Walls: Unveiling Bilateral Backdoor Effects in Diffusion Models
(
Poster
)
>
link
While state-of-the-art diffusion models (DMs) excel in image generation, concerns regarding their security persist. Earlier research highlighted DMs' vulnerability to backdoor attacks, but these studies placed stricter requirements than conventional methods like 'BadNets' in image classification. This is because the former necessitates modifications to the diffusion sampling and training procedures. Unlike the prior work, we investigate whether generating backdoor attacks in DMs can be as simple as BadNets, i.e., by only contaminating the training dataset without tampering the original diffusion process. In this more realistic backdoor setting, we uncover bilateral backdoor effects that not only serve an adversarial purpose (compromising the functionality of DMs) but also offer a defensive advantage (which can be leveraged for backdoor defense). On one hand, a BadNets-like backdoor attack remains effective in DMs for producing incorrect images that do not align with the intended text conditions. On the other hand, backdoored DMs exhibit an increased ratio of backdoor triggers, a phenomenon referred as 'trigger amplification', among the generated images. We show that the latter insight can be utilized to improve the existing backdoor detectors for the detection of backdoor-poisoned data points. Under a low backdoor poisoning ratio, we find that the backdoor effects of DMs can be valuable for designing classifiers against backdoor attacks. |
Zhuoshi Pan · Yuguang Yao · Gaowen Liu · Bingquan Shen · H. Vicky Zhao · Ramana Kompella · Sijia Liu 🔗 |
Fri 1:45 p.m. - 2:00 p.m.
|
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection
(
Oral
)
>
link
SlidesLive Video Instruction-tuned Large Language Models (LLMs) have demonstrated remarkable abilities to modulate their responses based on human instructions. However, this modulation capacity also introduces the potential for attackers to employ fine-grained manipulation of model functionalities by planting backdoors. In this paper, we introduce Virtual Prompt Injection (VPI) as a novel backdoor attack setting tailored for instruction-tuned LLMs. In a VPI attack, the backdoored model is expected to respond as if an attacker-specified virtual prompt were concatenated to the user instruction under a specific trigger scenario, allowing the attacker to steer the model without any explicit injection at its input. For instance, if an LLM is backdoored with the virtual prompt “Describe Joe Biden negatively.” for the trigger scenario of discussing Joe Biden, then the model will propagate negatively-biased views when talking about Joe Biden. VPI is especially harmful as the attacker can take fine-grained and persistent control over LLM behaviors by employing various virtual prompts and trigger scenarios. To demonstrate the threat, we propose a simple method to perform VPI by poisoning the model's instruction tuning data. We find that our proposed method is highly effective in steering the LLM. For example, by poisoning only 52 instruction tuning examples (0.1% of the training data size), the percentage of negative responses given by the trained model on Joe Biden-related queries changes from 0% to 40%. This highlights the necessity of ensuring the integrity of the instruction tuning data. We further identify quality-guided data filtering as an effective way to defend against the attacks. |
Jun Yan · Vikas Yadav · Shiyang Li · Lichang Chen · Zheng Tang · Hai Wang · Vijay Srinivasan · Xiang Ren · Hongxia Jin 🔗 |
Fri 2:00 p.m. - 2:15 p.m.
|
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
(
Oral
)
>
link
SlidesLive Video Large language models (LLMs) are shown to benefit from chain-of-thought (COT) prompting, particularly when tackling tasks that require systematic reasoning processes. On the other hand, COT prompting also poses new vulnerabilities in the form of backdoor attacks, wherein the model will output unintended malicious content under specific backdoor-triggered conditions during inference. In this paper, we propose BadChain, the first backdoor attack against LLMs employing COT prompting, which does not require access to the training dataset or model parameters. These advantages allow BadChain to be launched against commercial LLMs operated via API-only access and impose low computational overhead since BadChain does not need any model fine-tuning. BadChain leverages the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning step into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger is embedded in the query prompt. In particular, a subset of demonstrations will be manipulated to incorporate the backdoor reasoning step in COT prompting. Consequently, given any query prompt containing the backdoor trigger, the LLM will be misled to output unintended content. Empirically, we show the effectiveness of BadChain against four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) on six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning, compared with the ineffectiveness of the baseline backdoor attacks designed for simpler tasks such as semantic classification. Moreover, we demonstrate the interpretability of BadChain by showing that the relationship between the trigger and the backdoor reasoning step can be well-explained based on the output of the backdoored model. Finally, we propose two defenses based on shuffling and demonstrate their overall ineffectiveness against BadChain. Therefore, BadChain remains a severe threat to LLMs, underscoring the urgency for the development of effective future defenses. |
Zhen Xiang · Fengqing Jiang · Zidi Xiong · Bhaskar Ramasubramanian · Radha Poovendran · Bo Li 🔗 |
Fri 2:15 p.m. - 2:45 p.m.
|
Decoding Backdoors in LLMs and Their Implications
(
Invited Talk
)
>
SlidesLive Video In the rapidly evolving landscape of artificial intelligence, generative AI has emerged as a powerful and transformative technology with significant potential across various applications, such as medical, financial, and autonomous driving. However, with this immense potential comes the imperative to ensure the safety and trustworthiness of generative models before their large-scale deployment. In particular, as large language models (LLMs) become increasingly prevalent in real-world applications, understanding and mitigating the risks associated with potential backdoors is paramount. This talk will delve into the critical examination of backdoors embedded in LLMs and explore their potential implications on the security and reliability of these models in different applications. Specifically, I will first talk about different strategies for injecting backdoors in LLMs and a series of CoT frameworks. I will then discuss potential defenses against known and unknown backdoors in LLM. I will provide an overview of how to assess, improve, and certify the resilience of LLMs against potential backdoors. |
Bo Li 🔗 |
Fri 2:45 p.m. -
|
PANEL DISCUSSION
(
PANEL DISCUSSION
)
>
SlidesLive Video |
🔗 |