Grassmannian Optimization Drives Generationlization in Overparameterized DNN
Abstract
We present an overview of a rigorous geometric theory that addresses the long-standing problem of why heavily overparameterized deep neural networks generalize despite their ability to perfectly fit random labels (Part-1). The theory shows that gradient-based algorithms (GD/SGD) implicitly restrict optimization to a low-dimensional iso-loss manifold shaped by the data, converging with high probability to an \emph{optimal subspace} on the Grassmannian, defined by the Hessian, that contains all approximately optimal networks. Hessian degeneracy emerges as the hallmark of generalization: the \emph{effective dimensionality}—the rank of the Hessian at a minimizer—governs generalization, replacing the conventional reliance on parameter count, with isotropic random initialization identified as a driver of low dimensionality. Within this overview we highlight one key result: a closed-form, data-dependent expression for the generalization gap, resolving the decade-old open problem of gap estimation. This expression provides a precise relation between generalization error, data distribution, network, and architecture, eliminating the need for bounding arguments. The result is \emph{non-vacuous}—predicting test error with over 90\% accuracy in empirical studies—and improves upon VC, PAC-Bayesian, and spectral analyses by orders of magnitude. Beyond the NTK regime, the framework rigorously explains key empirical phenomena, including the role of flat minima, the implicit bias of SGD, and the origin of double-descent behavior. \footnote{This submission presents Part~1 of a broader framework. Part~2 Optimization Dynamics, not included here, develops the full dynamics of generalization, linking optimization configurations (learning rate, batch size, momentum) directly to generalization outcomes.}