Skip to yearly menu bar Skip to main content


Poster

Learning in the Presence of Low-dimensional Structure: A Spiked Random Matrix Perspective

Jimmy Ba · Murat Erdogdu · Taiji Suzuki · Zhichao Wang · Denny Wu

Great Hall & Hall B1+B2 (level 1) #817

Abstract: We consider the learning of a single-index target function f:RdR under spiked covariance data: f_*(\boldsymbol{x}) = \textstyle\sigma_*(\frac{1}{\sqrt{1+\theta}}\langle\boldsymbol{x},\boldsymbol{\mu}\rangle), ~~ \boldsymbol{x}\overset{\small\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,\boldsymbol{I_d} + \theta\boldsymbol{\mu}\boldsymbol{\mu}^\top), ~~ \theta\asymp d^{\beta} \text{ for } \beta\in[0,1), where the link function σ:RR is a degree-p polynomial with information exponent k (defined as the lowest degree in the Hermite expansion of σ), and it depends on the projection of input x onto the spike (signal) direction μRd. In the proportional asymptotic limit where the number of training examples n and the dimensionality d jointly diverge: n,d,n/dψ(0,), we ask the following question: how large should the spike magnitude θ (i.e., the strength of the low-dimensional component) be, in order for (i) kernel methods, (ii) neural networks optimized by gradient descent, to learn f? We show that for kernel ridge regression, β11p is both sufficient and necessary. Whereas for two-layer neural networks trained with gradient descent, β>11k suffices. Our results demonstrate that both kernel methods and neural networks benefit from low-dimensional structures in the data. Further, since kp by definition, neural networks can adapt to such structures more effectively.

Chat is not available.