[Re] Masked Autoencoders Are Small Scale Vision Learners: A Reproduction Under Resource Constraints

Athanasios Charisoudis · Simon Ekman von Huth · Emil Jansson

Great Hall & Hall B1+B2 (level 1) #807
[ ] [ Project Page ]
Tue 12 Dec 8:45 a.m. PST — 10:45 a.m. PST


Scope of Reproducibility — The Masked Autoencoder (MAE) was recently proposed as aframework for efficient self‐supervised pre‐training in Computer Vision [1]. In this pa‐per, we attempt a replication of the MAE under significant computational constraints.Specifically, we target the claim that masking out a large part of the input image yieldsa nontrivial and meaningful self‐supervisory task, which allows training models thatgeneralize well. We also present the Semantic Masked Autoencoder (SMAE), a novel yetsimple extension of MAE which uses perceptual loss to improve encoder embeddings.Methodology — The datasets and backbones we rely on are significantly smaller than thoseused by [1]. Our main experiments are performed on Tiny ImageNet (TIN) [2] and trans‐fer learning is performed on a low‐resolution version of CUB‐200‐2011 [3]. We use aViT‐Lite [4] as backbone. We also compare the MAE to DINO, an alternative frame‐work for self‐supervised learning [5]. The ViT, MAE, as well as perceptual loss wereimplemented from scratch, without consulting the original authors’ code. Our code isavailable at The computational budget for ourreproduction and extension was approximately 150 GPU hours.Results — This paper successfully reproduces the claim that the MAE poses a nontrivialand meaningful self‐supervisory task. We show that models trained with this frame‐work generalize well to new datasets and conclude that the MAE is reproducible withexception for some hyperparameter choices. We also demonstrate that MAE performswell with smaller backbones and datasets. Finally, our results suggest that the SMAEextension improves the downstream classification accuracy of the MAE on CUB (+5 pp)when coupled with an appropriate masking strategy.What was easy — Given prior experience with a deep learning framework, re‐implementingthe paper was relatively straightforward, with sufficient details given in the paper.What was difficult — We faced challenges implementing efficient patch shuffling and tun‐ing hyperparameters. The hyperparameter choices from [1] did not translate well to asmaller dataset and backbone.Communication with original authors — We have not had contact with the original authors.

Chat is not available.