NeurIPS Poster Learning in the Presence of Low-dimensional Structure: A Spiked Random Matrix Perspective

Poster

Learning in the Presence of Low-dimensional Structure: A Spiked Random Matrix Perspective

Jimmy Ba · Murat Erdogdu · Taiji Suzuki · Zhichao Wang · Denny Wu

Great Hall & Hall B1+B2 (level 1) #817

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Abstract: We consider the learning of a single-index target function

f_{*} : R^{d} \to R

$f_*: \mathbb{R}^d\to\mathbb{R}$ under spiked covariance data:

f_*(\boldsymbol{x}) = \textstyle\sigma_*(\frac{1}{\sqrt{1+\theta}}\langle\boldsymbol{x},\boldsymbol{\mu}\rangle), ~~ \boldsymbol{x}\overset{\small\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,\boldsymbol{I_d} + \theta\boldsymbol{\mu}\boldsymbol{\mu}^\top), ~~ \theta\asymp d^{\beta} \text{ for } \beta\in[0,1),

where the link function

σ_{*} : R \to R

$\sigma_*:\mathbb{R}\to\mathbb{R}$ is a degree-

p

$p$ polynomial with information exponent

k

$k$ (defined as the lowest degree in the Hermite expansion of

σ_{*}

$\sigma_*$ ), and it depends on the projection of input

x

$\boldsymbol{x}$ onto the spike (signal) direction

μ \in R^{d}

$\boldsymbol{\mu}\in\mathbb{R}^d$ . In the proportional asymptotic limit where the number of training examples

n

$n$ and the dimensionality

d

$d$ jointly diverge:

n, d \to \infty, n / d \to ψ \in (0, \infty)

$n,d\to\infty, n/d\to\psi\in(0,\infty)$ , we ask the following question: how large should the spike magnitude

θ

$\theta$ (i.e., the strength of the low-dimensional component) be, in order for

(i)

$(i)$ kernel methods,

(i i)

$(ii)$ neural networks optimized by gradient descent, to learn

f_{*}

$f_*$ ? We show that for kernel ridge regression,

β \geq 1 - \frac{1}{p}

$\beta\ge 1-\frac{1}{p}$ is both sufficient and necessary. Whereas for two-layer neural networks trained with gradient descent,

β > 1 - \frac{1}{k}

$\beta>1-\frac{1}{k}$ suffices. Our results demonstrate that both kernel methods and neural networks benefit from low-dimensional structures in the data. Further, since

k \leq p

$k\le p$ by definition, neural networks can adapt to such structures more effectively.

Chat is not available.