Sphinx: Visual Perception and Reasoning Gym
Abstract
We introduce \emph{Sphinx}, a synthetic gym for visual perception and reasoning that targets core cognitive primitives, including symmetry, transformation, and sequence induction. It procedurally generates motif- and tiling-based puzzles with verifiable solutions, enabling precise evaluation and the creation of scalable datasets. \emph{Sphinx} implements twelve tasks across four categories, spanning symmetry detection, visual sequences, spatial reasoning, and transformations. Benchmarking three ChatGPT models and five open-source vision–language models reveals that even the recent GPT-5 achieves only 45.2\% accuracy, with no model exceeding 70\% on any task. A preliminary study further demonstrates that reinforcement learning with verifiable rewards (RLVR) improves model performance, highlighting its potential for advancing multimodal reasoning.