Towards Visual Simulation in Multimodal Language Models
Catherine Finegan-Dollak
Abstract
Inspired by extensive evidence that humans use visual simulation to understand parts of language, this position paper explores analogous behavior in multimodal language models. Although grounding models in vision has been an active area of research in the AI community, visual simulation remains underexplored. We address this gap by formally defining visual simulation, identifying the architectural and training components to enable it, and proposing a multi-pronged approach to determine whether a model engages in visual simulation.
Chat is not available.
Successful Page Load