We investigate the sample complexity of networks with bounds on the magnitude of its weights. In particular, we consider the class [ \cn = \left{Wt\circ\rho\circ W{t1}\circ\rho\ldots\circ \rho\circ W{1} : W1,\ldots,W{t1}\in M{d\times d}, Wt\in M{1,d} \right} ] where the spectral norm of each $W_i$ is bounded by $O(1)$, the Frobenius norm is bounded by $R$, and $\rho$ is the sigmoid function $\frac{e^x}{1 + e^x}$ or the smoothened ReLU function $ \ln\left(1 + e^x\right)$. We show that for any depth $t$, if the inputs are in $[1,1]^d$, the sample complexity of $\cn$ is $\tilde O\left(\frac{dR^2}{\epsilon^2}\right)$. This bound is optimal up to logfactors, and substantially improves over the previous state of the art of $\tilde O\left(\frac{d^2R^2}{\epsilon^2}\right)$, that was established in a recent line of work.
We furthermore show that this bound remains valid if instead of considering the magnitude of the $Wi$'s, we consider the magnitude of $Wi  Wi^0$, where $Wi^0$ are some reference matrices, with spectral norm of $O(1)$. By taking the $W_i^0$ to be the matrices in the onset of the training process, we get sample complexity bounds that are sublinear in the number of parameters, in many {\em typical} regimes of parameters.
To establish our results we develop a new technique to analyze the sample complexity of families $\ch$ of predictors. We start by defining a new notion of a randomized approximate description of functions $f:\cx\to\reals^d$. We then show that if there is a way to approximately describe functions in a class $\ch$ using $d$ bits, then $\frac{d}{\epsilon^2}$ examples suffices to guarantee uniform convergence. Namely, that the empirical loss of all the functions in the class is $\epsilon$close to the true loss. Finally, we develop a set of tools for calculating the approximate description length of classes of functions that can be presented as a composition of linear function classes and nonlinear functions.
Author Information
Amit Daniely (Hebrew University and Google Research)
Elad Granot (Hebrew University)
Related Events (a corresponding poster, oral, or spotlight)

2019 Poster: Generalization Bounds for Neural Networks via Approximate Description Length »
Wed Dec 11th 05:00  07:00 PM Room East Exhibition Hall B + C
More from the Same Authors

2019 Poster: Locally Private Learning without Interaction Requires Separation »
Amit Daniely · Vitaly Feldman 
2017 Poster: SGD Learns the Conjugate Kernel Class of the Network »
Amit Daniely 
2016 Poster: Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity »
Amit Daniely · Roy Frostig · Yoram Singer 
2013 Poster: More data speeds up training time in learning halfspaces over sparse vectors »
Amit Daniely · Nati Linial · Shai ShalevShwartz 
2013 Spotlight: More data speeds up training time in learning halfspaces over sparse vectors »
Amit Daniely · Nati Linial · Shai ShalevShwartz 
2012 Poster: Multiclass Learning Approaches: A Theoretical Comparison with Implications »
Amit Daniely · Sivan Sabato · Shai ShalevShwartz 
2012 Spotlight: Multiclass Learning Approaches: A Theoretical Comparison with Implications »
Amit Daniely · Sivan Sabato · Shai ShalevShwartz