Investigating Abstraction Capabilities of the o3 Model Using Textual and Visual Modalities
Abstract
OpenAI’s o3-preview model exceeded human accuracy on the ARC-AGI benchmark, but does that mean that o3 can recognize and reason with the same abstractions humans use on these tasks? Here we report on a preliminary study addressing this question, using the ConceptARC benchmark. We run o3 in settings that vary input modality reasoning effort, and whether the model can use external Python tools in solving tasks. In each setting we evaluate performance with respect to (1) output accuracy and (2) correctness of generated rules, providing insight not only into whether the model solves tasks correctly, but also into its ability to infer intended abstractions as opposed to shallower, less generalizable “shortcuts.” Our results illuminate the effects of different settings on o3’s ability to solve tasks using the humanlike abstractions the tasks were designed to capture.