Scene Understanding via Scene Representation Generation with Vision-Language Models
Yuan Chen · Peng Shi
Abstract
Understanding complex environments requires capturing the arrangement of objects, their interactions, and contextual information. Early symbolic and data-driven approaches are limited by rigid designs or narrow applicability. Recent vision-language models (VLMs) provide rich priors and flexible reasoning, supporting the generation of structured scene descriptions that handle compositional arrangements, diverse categories, and realistic constraints. However, challenges remain in precise spatial reasoning, consistent object placement, and maintaining coherent geometry. We present a VLM-driven pipeline for scene representation generation, analyze its shortcomings through a case study, and suggest avenues for future enhancements.
Chat is not available.
Successful Page Load