NeurIPS Investigating the Effectiveness of Self-critiquing in LLMs solving Planning Tasks

Oral
in
Workshop: Foundation Models for Decision Making

Investigating the Effectiveness of Self-critiquing in LLMs solving Planning Tasks

Karthik Valmeekam · Matthew Marquez · Subbarao Kambhampati

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

There have been widespread claims about Large Language Models (LLMs) being able to successfully verify or self-critique their candidate solutions in reasoning problems in an iterative mode. Intrigued by those claims, in this paper we set out to investigate the verification/self-critiquing abilities of large language models in the context of planning. We evaluate a planning system that employs LLMs for both plan generation and verification. We assess the verifier LLM's performance against ground-truth verification, the impact of self-critiquing on plan generation, and the influence of varying feedback levels on system performance. Using GPT-4, a state-of-the-art LLM, for both generation and verification, our findings reveal that LLMs, when used as verifiers, produce a notable number of false positives, compromising system reliability. Additionally, self-critiquing appears to diminish plan generation performance, especially when compared to systems with external, sound verifiers. The nature of feedback, whether binary or detailed, showed minimal impact on plan generation. Collectively, our results cast doubt on the effectiveness of LLMs as verifiers in an iterative, self-critiquing framework for planning tasks.

Chat is not available.

Oral in Workshop: Foundation Models for Decision Making

Investigating the Effectiveness of Self-critiquing in LLMs solving Planning Tasks

Karthik Valmeekam · Matthew Marquez · Subbarao Kambhampati

Oral
in
Workshop: Foundation Models for Decision Making