Skip to yearly menu bar Skip to main content

Workshop: Machine Learning for Audio

Benchmarks and deep learning models for localizing rodent vocalizations in social interactions

Ralph Peterson · Aramis Tanelus · Aman Choudhri · Violet Ivan · Aaditya Prasad · David Schneider · Dan Sanes · Alex Williams


Social animals congregate in groups and vocalize to communicate. To study the dynamics of vocal communication and their neural basis, ethologists and neuroscientists have developed a multitude of approaches to attribute vocal calls to individual animals within an interacting social group. Invasive surgical procedures, such as affixing custom-built miniature sensors to each animal, are often needed to obtain precise measurements of which individual is vocalizing. In addition to being labor intensive and species specific, these surgeries are often not tractable in very small or young animals and may alter an animal’s natural behavioral repertoire. Thus, there is considerable interest in developing non-invasive sound source localization and vocal call attribution methods that work off-the-shelf in typical laboratory settings. To advance these aims in the domain of rodent neuroscience, we acquired synchronized video and multi-channel audio recordings with >300,000 annotated sound sources in small reverberant environments, and publicly release them as benchmarks. We then trained deep neural networks to localize and attribute vocal calls. This approach outperformed current protocols in the field, achieving~5 mm accuracy on speaker-emitted sounds. Further, deep network ensembles produced well-calibrated estimates of uncertainty for each prediction. However, network performance was not robust to distributional shifts in the data, highlighting limitations and open challenges for future work.

Chat is not available.