Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Safe Generative AI

Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

Leo McKee-Reid ⋅ Christoph Sträter ⋅ Maria Martinez ⋅ Joe Needham ⋅ Mikita Balesni

Abstract

Chat is not available.