NeurIPS Universal jailbreak backdoors from poisoned human feedback

Invited Talk
in
Workshop: Backdoors in Deep Learning: The Good, the Bad, and the Ugly

Universal jailbreak backdoors from poisoned human feedback

Florian Tramer

[ Abstract ]

Abstract:

Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. We consider the problem of poisoning the RLHF data to embed a backdoor trigger into the model. The trigger should act like a universal "sudo" command, enabling arbitrary harmful responses at test time. Universal jailbreak backdoors are much more powerful than previously studied backdoors on language models, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors.

Chat is not available.

Invited Talk in Workshop: Backdoors in Deep Learning: The Good, the Bad, and the Ugly

Universal jailbreak backdoors from poisoned human feedback

Florian Tramer

Invited Talk
in
Workshop: Backdoors in Deep Learning: The Good, the Bad, and the Ugly