When Does Gradient Descent Lead to Generalization for Shallow ReLU Networks?
Abstract
We study when gradient descent finds solutions that generalize for overparameterized two-layer ReLU networks through the lens of minima stability. Prior work links stable minima to a function class with data-dependent regularity. For data that is uniformly sampled from a ball, this work shows that such solutions generalize but with rates that suffer from the curse of dimensionality due to neural shattering. How this picture extends to broader data distributions has remained unclear. This paper advances this research direction and provides two complementary results: (1) gradient descent can adapt to low-dimensional structure via generalization bounds that depend on the intrinsic rather than ambient dimension and (2) for data distributed on the sphere there always exist perfectly interpolating solutions that are arbitrarily flat, indicating that stability of minima alone may not protect against overfitting. These findings suggest that the easier it is for the data to be shattered by ReLU atoms, the easier it is to overfit.