Skip to yearly menu bar Skip to main content

Workshop: Machine Learning for Audio

Towards Generalizable SER: Soft Labeling and Data Augmentation for Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech

Mohamed Osman · Tamer Nadeem · Ghada khoriba


Recognizing emotions in spoken communication is crucial for advanced human-machine interaction. Current emotion detection methodologies often display biases when applied cross-corpus. To address this, our study amalgamates 16 diverse datasets, resulting in 375 hours of data across languages like English, Chinese, and Japanese. We propose a soft labeling system capturing gradational emotional intensities. Using the Whisper encoder and a data augmentation inspired by contrastive learning, our method emphasizes the temporal dynamics of emotions. Our validation on 4 multilingual datasets demonstrates notable zero-shot generalization. We further fine-tune on Hume-Prosody and publish initial promising results.

Chat is not available.