NIPS Poster Stochastic Nested Variance Reduced Gradient Descent for Nonconvex Optimization

Poster

Stochastic Nested Variance Reduced Gradient Descent for Nonconvex Optimization

Dongruo Zhou · Pan Xu · Quanquan Gu

Room 210 #44

[ Abstract ]

Abstract: We study finite-sum nonconvex optimization problems, where the objective function is an average of

n

$n$ nonconvex functions. We propose a new stochastic gradient descent algorithm based on nested variance reduction. Compared with conventional stochastic variance reduced gradient (SVRG) algorithm that uses two reference points to construct a semi-stochastic gradient with diminishing variance in each epoch, our algorithm uses

K + 1

$K+1$ nested reference points to build an semi-stochastic gradient to further reduce its variance in each epoch. For smooth functions, the proposed algorithm converges to an approximate first order stationary point (i.e.,

‖ \nabla F (\xb) ‖_{2} \leq ϵ

$\|\nabla F(\xb)\|_2\leq \epsilon$ ) within

\tO (n \land ϵ^{- 2} + ϵ^{- 3} \land n^{1 / 2} ϵ^{- 2})

$\tO(n\land \epsilon^{-2}+\epsilon^{-3}\land n^{1/2}\epsilon^{-2})$ \footnote{

\tO (\cdot)

$\tO(\cdot)$ hides the logarithmic factors} number of stochastic gradient evaluations, where

n

$n$ is the number of component functions, and

ϵ

$\epsilon$ is the optimization error. This improves the best known gradient complexity of SVRG

O (n + n^{2 / 3} ϵ^{- 2})

$O(n+n^{2/3}\epsilon^{-2})$ and the best gradient complexity of SCSG

O (ϵ^{- 5 / 3} \land n^{2 / 3} ϵ^{- 2})

$O(\epsilon^{-5/3}\land n^{2/3}\epsilon^{-2})$ . For gradient dominated functions, our algorithm achieves

\tO (n \land τ ϵ^{- 1} + τ \cdot (n^{1 / 2} \land (τ ϵ^{- 1})^{1 / 2})

$\tO(n\land \tau\epsilon^{-1}+\tau\cdot (n^{1/2}\land (\tau\epsilon^{-1})^{1/2})$ gradient complexity, which again beats the existing best gradient complexity

\tO (n \land τ ϵ^{- 1} + τ \cdot (n^{1 / 2} \land (τ ϵ^{- 1})^{2 / 3})

$\tO(n\land \tau\epsilon^{-1}+\tau\cdot (n^{1/2}\land (\tau\epsilon^{-1})^{2/3})$ achieved by SCSG. Thorough experimental results on different nonconvex optimization problems back up our theory.

Live content is unavailable. Log in and register to view live content