Skip to yearly menu bar Skip to main content

Workshop: Machine Learning for Audio

Towards Generalizable SER: Soft Labeling and Data Augmentation for Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech

Mohamed Osman · Tamer Nadeem · Ghada khoriba

[ ]
presentation: Machine Learning for Audio
Sat 16 Dec 6:20 a.m. PST — 3:30 p.m. PST


Recognizing emotions in spoken communication is crucial for advanced human-machine interaction. Current emotion detection methodologies often display biases when applied cross-corpus. To address this, our study amalgamates 16 diverse datasets, resulting in 375 hours of data across languages like English, Chinese, and Japanese. We propose a soft labeling system capturing gradational emotional intensities. Using the Whisper encoder and a data augmentation inspired by contrastive learning, our method emphasizes the temporal dynamics of emotions. Our validation on 4 multilingual datasets demonstrates notable zero-shot generalization. We further fine-tune on Hume-Prosody and publish initial promising results.

Chat is not available.