Skip to yearly menu bar Skip to main content

Workshop: Machine Learning for Audio

Self-Supervised Speech Enhancement using Multi-Modal Data

Yu-Lin Wei · Rajalaxmi Rajagopalan · Bashima Islam · Romit Roy Choudhury


We consider the problem of speech enhancement in earphones. While microphones are classical speech sensors, motion sensors embedded in modern earphones also pick up faint components of the user’s speech. While this faint motion data has generally been ignored, we show that they can serve as a pathway for selfsupervised speech enhancement. Our proposed model is an iterative framework in which the motion data offers a hint to the microphone (in the form of an estimated posterior); the microphone SNR improves from the hint, which then helps the motion data to refine it’s next hint. Results show that this alternating self-supervision converges even in the presence of strong ambient noise, and the performance is comparable to supervised Denoisers. When small amount of training data is available, our model outperforms the same Denoisers.

Chat is not available.