Spotlight Poster
Voxel Mamba: Group-Free State Space Models for Point Cloud based 3D Object Detection
Guowen Zhang · Lue Fan · Chenhang HE · Zhen Lei · ZHAO-XIANG ZHANG · Lei Zhang
East Exhibit Hall A-C #1500
Window-based Voxel Transformer and Sparse Convolutional Neural Network (SpCNN) are the most popular architectures for point cloud understanding. While SpCNN demonstrates high efficiency with linear complexity, Voxel Transformer is advantageous in point cloud modeling for its larger receptive field and dynamic weights. It is highly anticipated that we can develop a new architecture to possess both high modeling capability and computational efficiency. Inspired by the recent advances in state space models, we present a Voxel State Space Model, termed as Voxel Mamba, to model long-range dependencies with larger receptive fields. In contrast to previous window-based grouping operations, we directly process the entire voxel sequence to avoid inefficient partitioning. Voxel Mamba achieves the voxel interaction across the entire scene, and captures the cloud locality information through space-filling curves. To address the variation in object sizes, we propose Asymmetrical State Space Models (ASSM) to capture multi-scale context information while maintaining the original resolution. We also introduce a simple yet effective region position embedding strategy to enhance the localization of tokens in long sequences. Voxel Mamba achieves not only higher accuracy than the well-established window-based Transformers (e.g., DSVT) and SpCNN-based methods (e.g., PV-RCNN) but also shows significant advancements in computational and memory efficiency, as demonstrated on the Waymo Open dataset and NuScenes dataset. Codes will be made publicly available.
Live content is unavailable. Log in and register to view live content