Skip to yearly menu bar Skip to main content


Poster

Large Scene Model: Real-time Unposed Images to Semantic 3D

Zhiwen Fan · Jian Zhang · Wenyan Cong · Peihao Wang · Renjie Li · Kairun Wen · Shijie Zhou · Achuta Kadambi · Zhangyang "Atlas" Wang · Danfei Xu · Boris Ivanovic · Marco Pavone · Yue Wang

West Ballroom A-D #7205
[ ]
Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

A classical problem in computer vision is to reconstruct and understand the 3D structure from a limited number of images to accurately interpret and export geometry, appearance, and semantics. Traditional approaches typically decompose this objective into multiple subtasks, involving several stages of complicated mapping among different data representations.For instance, dense reconstruction through Structure-from-Motion (SfM) requires transforming a set of images into key points and camera parameters before estimating structures.3D understanding relies on a lengthy reconstruction pipeline before inputting into data- and task-specific neural networks.This paradigm can result in extensive processing times and engineering efforts for processing each scene.In this work, we introduce the Large Scene Model (LSM) that directly processes unposed RGB images into semantic radiance fields. This new model simultaneously infers geometry, appearance, and semantics within a scene, and synthesizes versatile label maps at novel views, all in a single feed-forward pass. To represent the scene, we employ a generic Transformer-based framework that integrates global geometry by pixel-aligned point maps. To facilitate scene attributes regression, we adopt local context aggregation with multi-scale fusion tailored for enhanced prediction accuracy. Addressing the scarcity of labeled 3D semantic data and enhancing scene manipulation capabilities via natural language, we incorporate a well-trained 2D model into a 3D consistent semantic feature field. An efficient decoder parameterizes a set of anisotropic Gaussians, enabling the rendering of scene attributes at novel views.Supervised end-to-end learning and comprehensive experiments on various tasks demonstrate that LSM can unify multiple 3D vision tasks. It achieves both real-time reconstruction and rendering speeds and outperforms state-of-the-art baselines.

Live content is unavailable. Log in and register to view live content