Abstract:
We consider the robust linear regression problem in the online setting where we have access to the data in a streaming manner, one data point after the other. More specifically, for a true parameter ฮธโฮธโ, we consider the corrupted Gaussian linear model $y = + \varepsilon + bwheretheadversarialnoisewheretheadversarialnoisebcantakeanyvaluewithprobabilitycantakeanyvaluewithprobability\etaandequalszerootherwise.Weconsiderthisadversarytobeoblivious(i.e.,andequalszerootherwise.Weconsiderthisadversarytobeoblivious(i.e.,bindependentofthedata)sincethisistheonlycontaminationmodelunderwhichconsistencyispossible.Currentalgorithmsrelyonhavingthewholedataathandinordertoidentifyandremovetheoutliers.Incontrast,weshowinthisworkthatstochasticgradientdescentonthel1lossconvergestothetrueparametervectorataindependentofthedata)sincethisistheonlycontaminationmodelunderwhichconsistencyispossible.Currentalgorithmsrelyonhavingthewholedataathandinordertoidentifyandremovetheoutliers.Incontrast,weshowinthisworkthatstochasticgradientdescentonthel1lossconvergestothetrueparametervectorata\tilde{O}( 1 / (1 - \eta)^2 n )$ rate which is independent of the values of the contaminated measurements. Our proof relies on the elegant smoothing of the l1 loss by the Gaussian data and a classical non-asymptotic analysis of Polyak-Ruppert averaged SGD. In addition, we provide experimental evidence of the efficiency of this simple and highly scalable algorithm.
Chat is not available.