Poster

Escaping the Gravitational Pull of Softmax

Jincheng Mei ⋅ Chenjun Xiao ⋅ Bo Dai ⋅ Lihong Li ⋅ Csaba Szepesvari ⋅ Dale Schuurmans

2020 Poster

Paper PDF [ Paper ]

Abstract

The softmax is the standard transformation used in machine learning to map real-valued vectors to categorical distributions. Unfortunately, this transform poses serious drawbacks for gradient descent (ascent) optimization. We reveal this difficulty by establishing two negative results: (1) optimizing any expectation with respect to the softmax must exhibit sensitivity to parameter initialization (softmax gravity well''), and (2) optimizing log-probabilities under the softmax must exhibit slow convergence (softmax damping''). Both findings are based on an analysis of convergence rates using the Non-uniform \L{}ojasiewicz (N\L{}) inequalities. To circumvent these shortcomings we investigate an alternative transformation, the \emph{escort} mapping, that demonstrates better optimization properties. The disadvantages of the softmax and the effectiveness of the escort transformation are further explained using the concept of N\L{} coefficient. In addition to proving bounds on convergence rates to firmly establish these results, we also provide experimental evidence for the superiority of the escort transformation.

Video

Chat is not available.