Text-to-Scene with Large Reasoning Models
Abstract
Prompt-driven scene synthesis enables the creation of complete 3D environments directly from textual descriptions. Existing text-to-scene approaches often falter on complex geometries, object transformations, and adherence to compositional instructions. We introduce Reason-3D, a multimodal reasoning framework for text-to-scene generation that builds on large reasoning models (LRMs). Reason-3D integrates object retrieval grounded in physical, functional, and contextual attributes, and arranges them through implicit and explicit spatial constraints. A collision-aware refinement stage further ensures geometric plausibility. Evaluated on a spectrum of instructions, Reason-3D substantially improves visual fidelity, constraint satisfaction, and asset retrieval over prior baselines. Beyond 3D synthesis, our work highlights the potential of LRMs as general-purpose engines for multimodal algorithmic reasoning in spatial and physical domains. We release the codebase to further the research in object retrieval and placement with LRMs.