Intrinsic dimension estimation for Radio Galaxy Zoo using diffusion models
Abstract
In this work we estimate intrinsic dimension (iD) of subsets of the Radio Galaxy Zoo (RGZ) dataset using a score based diffusion model. Since RGZ is a large unlabelled dataset of radio sources, understanding its underlying geometry through iD can help us design better learning algorithms. We impose some structure in our analysis by examining the iD estimates as a function of Bayesian neural network (BNN) energy scores. These energy scores are a measure of how similar the radio sources are to the small labelled dataset of radio galaxies that the BNN was trained on. We find that the iD estimates increase as a function of distribution shift of the radio sources from the most in-distribution sources, which suggests more unusual and complex morphologies. We also find that the iD estimates for the RGZ dataset are significantly higher compared to those reported for natural images in the literature. Future work using the RGZ dataset could make use of the relationship between iD and energy scores to quantitatively study and improve the representations learned by various self-supervised learning algorithms.