NeurIPS Poster Benign, Tempered, or Catastrophic: Toward a Refined Taxonomy of Overfitting

Poster

Benign, Tempered, or Catastrophic: Toward a Refined Taxonomy of Overfitting

Neil Mallinar · James Simon · Amirhesam Abedsoltan · Parthe Pandit · Misha Belkin · Preetum Nakkiran

Hall J (level 1) #623

Keywords: [ generalization ] [ overfitting ] [ kernel regression ] [ benign overfitting ] [ powerlaws ] [ eigenspectra ]

[ Abstract ]

[ Paper] [ Slides] [ Poster] [ OpenReview]

Abstract:

The practical success of overparameterized neural networks has motivated the recent scientific study of \emph{interpolating methods}-- learning methods which are able fit their training data perfectly. Empirically, certain interpolating methods can fit noisy training data without catastrophically bad test performance, which defies standard intuitions from statistical learning theory. Aiming to explain this, a large body of recent work has studied \emph{benign overfitting}, a behavior seen in certain asymptotic settings under which interpolating methods approach Bayes-optimality, even in the presence of noise. In this work, we argue that, while benign overfitting has been instructive to study, real interpolating methods like deep networks do not fit benignly. That is, noise in the train set leads to suboptimal generalization, suggesting that these methods fall in an intermediate regime between benign and catastrophic overfitting, in which asymptotic risk is neither is neither Bayes-optimal nor unbounded, with the confounding effect of the noise being tempered" but non-negligible. We call this behavior \textit{tempered overfitting}. We first provide broad empirical evidence for our three-part taxonomy, demonstrating that deep neural networks and kernel machines fit to noisy data can be reasonably well classified as benign, tempered, or catastrophic. We then specialize to kernel (ridge) regression (KR), obtaining conditions on the ridge parameter and kernel eigenspectrum under which KR exhibits each of the three behaviors, demonstrating the consequences for KR with common kernels and trained neural networks of infinite width using experiments on natural and synthetic datasets.

Chat is not available.