Affinity Workshop: WiML Workshop 1

Depth without the Magic: Inductive Biases of Natural Gradient Descent

Anna Mészáros · Anna Kerekes · Ferenc Huszar


In gradient descent, changing how we parametrize the model can lead to very different optimization trajectories and even to qualitatively different optima. Exploiting only the form of over-parametrization, gradient descent alone can produce a surprising range of meaningful behaviours: identify sparse classifiers or reconstruct low-rank matrices without the need for explicit regularisation. This implicit regularisation has been hypothesised to be a contributing factor to good generalisation in deep learning. However, natural gradient descent with infinitesimally small learning rate is invariant to parameterization, it always follows the same trajectory and finds the same optimum. The question naturally arises: what happens if we eliminate the role of parameterization, which solution will be found, what new properties occur? We characterise the behaviour of natural gradient flow in linearly separable classification under logistic loss and discover new invariance properties. Some of our findings extend to nonlinear neural networks with sufficient but finite over-parametrization. In addition, we demonstrate experimentally that there exist learning problems where natural gradient descent can not reach good test performance, while gradient descent with the right architecture can.

Chat is not available.