MCMAE: Masked Convolution Meets Masked Autoencoders

Peng Gao · Teli Ma · Hongsheng Li · Ziyi Lin · Jifeng Dai · Yu Qiao

Hall J #628

Keywords: [ Vision transformer ] [ Masked auto-encoders ] [ Convolution Neural Networks ]

[ Abstract ]
[ Paper [ Slides [ Poster [ OpenReview
Wed 30 Nov 9 a.m. PST — 11 a.m. PST


Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our MCMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks. A simple block-wise masking strategy is proposed to ensure computational efficiency. We also propose to more directly supervise the multi-scale features of the encoder to boost multi-scale features. Based on our pretrained MCMAE models, MCMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base. On object detection, MCMAE-Base finetuned for only 25 epochs surpasses MAE-Base fined-tuned for 100 epochs by 2.9% box AP and 2.2% mask AP respectively. Code and pretrained models are available at \url{}.

Chat is not available.