SongMAE: Fine-Grained Syllable Discovery in Birdsong Using Asymmetric Patches
Abstract
Self-supervised bioacoustic encoders have been used for species classification but so far have not addressed syllable-level structure in birdsong. We introduce SongMAE, a compact MAE-ViT (Masked Autoencoder Vision Transformer) that operates on mel spectrograms with 2 ms temporal resolution. SongMAE is pre-trained with masked spectrogram reconstruction on diverse bioacoustic recordings. Despite a 14M-parameter footprint and 2 s context, its embeddings cluster some canary, Zebra Finch, and Bengalese Finch syllable types, yielding syllable-separable latent spaces and indicating the possibility of zero-shot, syllable-level analysis suitable for on-device ecological monitoring. We discuss limitations (lack of quantitative benchmarks and comparisons) and outline directions—patch aspect-ratio ablations, larger pretraining, and multi-resolution training—to enable unsupervised analysis of song components with a general pretrained model.