Timezone: »
Despite the great success of pre-trained language models (PLMs) in a large set of natural language processing (NLP) tasks, there has been a growing concern about their security in real-world applications. Backdoor attack, which poisons a small number of training samples by inserting backdoor triggers, is a typical threat to security. Trained on the poisoned dataset, a victim model would perform normally on benign samples but predict the attacker-chosen label on samples containing pre-defined triggers. The vulnerability of PLMs under backdoor attacks has been proved with increasing evidence in the literature. In this paper, we present several simple yet effective training strategies that could effectively defend against such attacks. To the best of our knowledge, this is the first work to explore the possibility of backdoor-free adaptation for PLMs. Our motivation is based on the observation that, when trained on the poisoned dataset, the PLM's adaptation follows a strict order of two stages: (1) a moderate-fitting stage, where the model mainly learns the major features corresponding to the original task instead of subsidiary features of backdoor triggers, and (2) an overfitting stage, where both features are learned adequately. Therefore, if we could properly restrict the PLM's adaptation to the moderate-fitting stage, the model would neglect the backdoor triggers but still achieve satisfying performance on the original task. To this end, we design three methods to defend against backdoor attacks by reducing the model capacity, training epochs, and learning rate, respectively. Experimental results demonstrate the effectiveness of our methods in defending against several representative NLP backdoor attacks. We also perform visualization-based analysis to attain a deeper understanding of how the model learns different features, and explore the effect of the poisoning ratio. Finally, we explore whether our methods could defend against backdoor attacks for the pre-trained CV model. The codes are publicly available at https://github.com/thunlp/Moderate-fitting.
Author Information
Biru Zhu (Tsinghua University, Tsinghua University)
Yujia Qin (Tsinghua University)
Ganqu Cui (Tsinghua University, Tsinghua University)
Yangyi Chen (Huazhong University of Science and Technology)
Weilin Zhao (Tsinghua University)
Chong Fu (Zhejiang University)
Yangdong Deng (Tsinghua University, Tsinghua University)
Zhiyuan Liu (Tsinghua University)
Jingang Wang (Meituan)
Wei Wu (Meituan-Dianping Group)
Maosong Sun (Tsinghua University)
Ming Gu (Tsinghua University)
More from the Same Authors
-
2022 Poster: Sparse Structure Search for Delta Tuning »
Shengding Hu · Zhen Zhang · Ning Ding · Yadao Wang · Yasheng Wang · Zhiyuan Liu · Maosong Sun -
2022 Poster: A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks »
Ganqu Cui · Lifan Yuan · Bingxiang He · Yangyi Chen · Zhiyuan Liu · Maosong Sun -
2022 Spotlight: Lightning Talks 5A-4 »
Yangrui Chen · Zhiyang Chen · Liang Zhang · Hanqing Wang · Jiaqi Han · Shuchen Wu · shaohui peng · Ganqu Cui · Yoav Kolumbus · Noemi Elteto · Xing Hu · Anwen Hu · Wei Liang · Cong Xie · Lifan Yuan · Noam Nisan · Wenbing Huang · Yousong Zhu · Ishita Dasgupta · Luc V Gool · Tingyang Xu · Rui Zhang · Qin Jin · Zhaowen Li · Meng Ma · Bingxiang He · Yangyi Chen · Juncheng Gu · Wenguan Wang · Ke Tang · Yu Rong · Eric Schulz · Fan Yang · Wei Li · Zhiyuan Liu · Jiaming Guo · Yanghua Peng · Haibin Lin · Haixin Wang · Qi Yi · Maosong Sun · Ruizhi Chen · Chuan Wu · Chaoyang Zhao · Yibo Zhu · Liwei Wu · xishan zhang · Zidong Du · Rui Zhao · Jinqiao Wang · Ling Li · Qi Guo · Ming Tang · Yunji Chen -
2022 Spotlight: A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks »
Ganqu Cui · Lifan Yuan · Bingxiang He · Yangyi Chen · Zhiyuan Liu · Maosong Sun -
2020 Poster: Graph Policy Network for Transferable Active Learning on Graphs »
Shengding Hu · Zheng Xiong · Meng Qu · Xingdi Yuan · Marc-Alexandre Côté · Zhiyuan Liu · Jian Tang -
2020 Poster: Zero-Resource Knowledge-Grounded Dialogue Generation »
Linxiao Li · Can Xu · Wei Wu · YUFAN ZHAO · Xueliang Zhao · Chongyang Tao -
2020 Poster: Towards Interpretable Natural Language Understanding with Explanations as Latent Variables »
Wangchunshu Zhou · Jinyi Hu · Hanlin Zhang · Xiaodan Liang · Maosong Sun · Chenyan Xiong · Jian Tang -
2012 Poster: Monte Carlo Methods for Maximum Margin Supervised Topic Models »
Qixia Jiang · Jun Zhu · Maosong Sun · Eric Xing