Algorithmic Oversight for Deceptive Reasoning
Abstract
This paper examines the detection of adversarial or deliberate reasoning errors in large language models (LLMs) within the context of mathematical reasoning. Our research proceeds in two steps: first, we develop strategies to induce reasoning errors in LLMs, revealing vulnerabilities even in strong models. Second, we introduce defense mechanisms including structured prompting, fine-tuning, and greybox access, significantly improving detection accuracy. We also present ProbShift, a novel algorithm that uses token probabilities for improved deceptive reasoning detection, outperforming GPT-3.5 when coupled with LLM-based oversight. Our findings underscore the importance and effectiveness of algorithmic oversight mechanisms for LLMs in complex reasoning tasks.