Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Safe Generative AI

Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

Leo McKee-Reid · Christoph Sträter · Maria Martinez · Joe Needham · Mikita Balesni

Abstract

Chat is not available.