Neural Universal Scene Descriptors
Abstract
Although recent progress in generative modeling has produced models capable of generating high-quality images conditioned on multiple modalities, there exists no common portable representation format for specifying conditioning signals. Instead, conditioning techniques are usually tailor-made for specific model architectures, and limit the user to a small set of control signals. In addition, common approaches are not object centric, meaning the user is not able to control individual objects in the image, and changing the conditioning signal leads to global changes. In contrast, the computer graphics community has developed standards like the Universal Scene Descriptor (USD), which represents scenes and objects in a structured, hierarchical manner. Inspired by USD, we propose the “Neural Universal Scene Descriptor” (Neural USD), a flexible conditioning structure that accommodates diverse signals, minimizes model-specific constraints, and enables per-object control over appearance, geometry, and pose. We further apply a fine-tuning approach that ensures disentangled control signals and evaluate key design considerations for a universal conditioning format, demonstrating how Neural USD enables iterative and incremental workflows.