Skip to yearly menu bar Skip to main content

Workshop: I Can’t Believe It’s Not Better (ICBINB): Failure Modes in the Age of Foundation Models

Can Visual Scratchpads With Diagrammatic Abstractions Augment LLM Reasoning?

Joy Hsu · Gabriel Poesia · Jiajun Wu · Noah Goodman


When humans reason about complex text-based questions, we leverage diagrammatic abstractions drawn on a visual scratchpad. In this paper, we introduce and explore the capabilities of Visual-Scratchpad, a method that augments a large language foundation model (LLM) with diagrammatic execution and readout. We enable the LLM to generate drawing commands and to readout abstractions from the resulting picture. The visual readout operation uses a visual foundation model, optionally finetuned with expert iteration. Here, we show that although Visual-Scratchpad outperforms an inference-only LLM, it surprisingly yields worse performance compared to a single finetuned LLM. Through experiments, we propose that this gap is due to the failure mode of vision foundation models in understanding abstractions in diagrams.

Chat is not available.