Oral

Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization

Kaiyue Wen · Zhiyuan Li · Tengyu Ma

Room R06-R09 (level 2)
[ Abstract ] [ Livestream: Visit Oral 1D DL Theory ]
Tue 12 Dec 8 a.m. — 8:15 a.m. PST

Despite extensive studies, the underlying reason as to why overparameterizedneural networks can generalize remains elusive. Existing theory shows that common stochastic optimizers prefer flatter minimizers of the training loss, and thusa natural potential explanation is that flatness implies generalization. This workcritically examines this explanation. Through theoretical and empirical investigation, we identify the following three scenarios for two-layer ReLU networks: (1)flatness provably implies generalization; (2) there exist non-generalizing flattestmodels and sharpness minimization algorithms fail to generalize poorly, and (3)perhaps most strikingly, there exist non-generalizing flattest models, but sharpnessminimization algorithms still generalize. Our results suggest that the relationshipbetween sharpness and generalization subtly depends on the data distributionsand the model architectures and sharpness minimization algorithms do not onlyminimize sharpness to achieve better generalization. This calls for the search forother explanations for the generalization of over-parameterized neural networks

Chat is not available.