Escaping the Gravitational Pull of Softmax
Jincheng Mei, Chenjun Xiao, Bo Dai, Lihong Li, Csaba Szepesvari, Dale Schuurmans
Oral presentation: Orals & Spotlights Track 09: Reinforcement Learning
on 2020-12-08T06:15:00-08:00 - 2020-12-08T06:30:00-08:00
on 2020-12-08T06:15:00-08:00 - 2020-12-08T06:30:00-08:00
Toggle Abstract Paper (in Proceedings / .pdf)
Abstract: The softmax is the standard transformation used in machine learning to map real-valued vectors to categorical distributions. Unfortunately, this transform poses serious drawbacks for gradient descent (ascent) optimization. We reveal this difficulty by establishing two negative results: (1) optimizing any expectation with respect to the softmax must exhibit sensitivity to parameter initialization (``softmax gravity well''), and (2) optimizing log-probabilities under the softmax must exhibit slow convergence (``softmax damping''). Both findings are based on an analysis of convergence rates using the Non-uniform \L{}ojasiewicz (N\L{}) inequalities. To circumvent these shortcomings we investigate an alternative transformation, the \emph{escort} mapping, that demonstrates better optimization properties. The disadvantages of the softmax and the effectiveness of the escort transformation are further explained using the concept of N\L{} coefficient. In addition to proving bounds on convergence rates to firmly establish these results, we also provide experimental evidence for the superiority of the escort transformation.