Skip to yearly menu bar Skip to main content


Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods

Laurence Aitchison

Poster Session 3 #759


We formulate the problem of neural network optimization as Bayesian filtering, where the observations are backpropagated gradients. While neural network optimization has previously been studied using natural gradient methods which are closely related to Bayesian inference, they were unable to recover standard optimizers such as Adam and RMSprop with a root-mean-square gradient normalizer, instead getting a mean-square normalizer. To recover the root-mean-square normalizer, we find it necessary to account for the temporal dynamics of all the other parameters as they are optimized. The resulting optimizer, AdaBayes, adaptively transitions between SGD-like and Adam-like behaviour, automatically recovers AdamW, a state of the art variant of Adam with decoupled weight decay, and has generalisation performance competitive with SGD.

Chat is not available.