Convex Neural Networks For Robust ASR Language Detection
Abstract
Globalization and multiculturalism have produced diverse dialects, such as Singaporean-accented English and regional Mandarin speech. These variants often remain under-represented even in high-resource language data. As a result, spoken dialogue systems frequently misidentify the user’s input language up to 49.33\% of the time \cite{goh2016anatomy}, reducing response accuracy regardless of language model capability. We propose a robust ASR framework for handling dialectal variance with reduced computational overhead and lightweight training costs. Our Convex Language Detection (CLD) framework integrates a convex neural network that admits global optimality in polynomial time. This is solved efficiently with ADMM in JAX, in order to maintain sub-500ms inference latency. CLD offers strong convergence guarantees, stability across runs, and reduced sample complexity. As a motivating case study, CLD significantly improves transcription accuracy on mixed-dialect inputs when integrated with Whisper encoders. These results practically enable more inclusive multilingual interaction, and demonstrate promising directions for principled statistical generalization and optimization theory in spoken dialogue systems.