Poster
in
Workshop: Foundation Model Interventions

Algorithmic Oversight for Deceptive Reasoning

Ege Onur Taga · Mingchen Li · Yongqi Chen · Samet Oymak

Keywords: alignment oversight deceptive reasoning LLMs

Project Page [ OpenReview]

Abstract

This paper examines the detection of adversarial or deliberate reasoning errors in large language models (LLMs) within the context of mathematical reasoning. Our research proceeds in two steps: first, we develop strategies to induce reasoning errors in LLMs, revealing vulnerabilities even in strong models. Second, we introduce defense mechanisms including structured prompting, fine-tuning, and greybox access, significantly improving detection accuracy. We also present ProbShift, a novel algorithm that uses token probabilities for improved deceptive reasoning detection, outperforming GPT-3.5 when coupled with LLM-based oversight. Our findings underscore the importance and effectiveness of algorithmic oversight mechanisms for LLMs in complex reasoning tasks.

Chat is not available.