Timezone: »

TDC 2023 (LLM Edition): The Trojan Detection Challenge
Mantas Mazeika · Andy Zou · Norman Mu · Long Phan · Zifan Wang · Chunru Yu · Adam Khoja · Fengqing Jiang · Aidan O'Gara · Zhen Xiang · Arezoo Rajabi · Dan Hendrycks · Radha Poovendran · Bo Li · David Forsyth

Sat Dec 16 11:30 AM -- 02:30 PM (PST) @ Room 357

The Trojan Detection Challenge (LLM Edition) aims to advance the understanding and development of methods for detecting hidden functionality in large language models (LLMs). The competition features two main tracks: the Trojan Detection Track and the Red Teaming Track. In the Trojan Detection Track, participants are given a large language model containing thousands of trojans and tasked with discovering the triggers for these trojans. In the Red Teaming Track, participants are challenged to elicit specific undesirable behaviors from a large language model fine-tuned to avoid those behaviors. TDC 2023 will include Base Model and Large Model subtracks to enable broader participation, and established trojan detection and red teaming baselines will be provided as a starting point. By uniting trojan detection and red teaming, TDC 2023 aims to foster collaboration between these communities to promote research on hidden functionality in LLMs and enhance the robustness and security of AI systems.

Author Information

Mantas Mazeika (University of Illinois Urbana-Champaign)
Andy Zou (CMU)
Norman Mu (UC Berkeley)
Long Phan (Center for AI Safety)
Zifan Wang (Center for AI Safety)
Chunru Yu (UIUC)
Adam Khoja (UC Berkeley)
Fengqing Jiang (University of Washington)
Aidan O'Gara (USC)
Zhen Xiang (University of Illinois Urbana-Champaign)
Arezoo Rajabi (UW)
Dan Hendrycks (Center for AI Safety)
Radha Poovendran (University of Washington, Seattle)
Bo Li (UChicago/UIUC)
David Forsyth (University of Illinois at Urbana-Champaign)

More from the Same Authors