NeurIPS Poster Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

Poster

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

Chulhee Yun · Suvrit Sra · Ali Jadbabaie

East Exhibition Hall B, C #233

Keywords: [ Learning Theory ] [ Theory ] [ Optimization for Deep Networks; Theory ] [ Deep Learning ]

[ Abstract ]

Abstract: We study finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require

N

$N$ hidden nodes to memorize/interpolate arbitrary

N

$N$ data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with

Ω (\sqrt{N})

$\Omega(\sqrt{N})$ hidden nodes can perfectly memorize most datasets with

N

$N$ points. We also prove that width

Θ (\sqrt{N})

$\Theta(\sqrt{N})$ is necessary and sufficient for memorizing

N

$N$ data points, proving tight bounds on memorization capacity. The sufficiency result can be extended to deeper networks; we show that an

L

$L$ -layer network with

W

$W$ parameters in the hidden layers can memorize

N

$N$ data points if

W = Ω (N)

$W = \Omega(N)$ . Combined with a recent upper bound

O (W L log W)

$O(WL\log W)$ on VC dimension, our construction is nearly tight for any fixed

L

$L$ . Subsequently, we analyze memorization capacity of residual networks under a general position assumption; we prove results that substantially reduce the known requirement of

N

$N$ hidden nodes. Finally, we study the dynamics of stochastic gradient descent (SGD), and show that when initialized near a memorizing global minimum of the empirical risk, SGD quickly finds a nearby point with much smaller empirical risk.

Live content is unavailable. Log in and register to view live content