NIPS Poster How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

[ Abstract ]

Abstract: Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives

f (x)

$f(x)$ . However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when

f (x)

$f(x)$ is convex. If

f (x)

$f(x)$ is convex, to find a point with gradient norm

ε

$\varepsilon$ , we design an algorithm SGD3 with a near-optimal rate

\tilde{O} (ε^{- 2})

$\tilde{O}(\varepsilon^{-2})$ , improving the best known rate

O (ε^{- 8 / 3})

$O(\varepsilon^{-8/3})$ . If

f (x)

$f(x)$ is nonconvex, to find its

ε

$\varepsilon$ -approximate local minimum, we design an algorithm SGD5 with rate

\tilde{O} (ε^{- 3.5})

$\tilde{O}(\varepsilon^{-3.5})$ , where previously SGD variants only achieve

\tilde{O} (ε^{- 4})

$\tilde{O}(\varepsilon^{-4})$ . This is no slower than the best known stochastic version of Newton's method in all parameter regimes.

Live content is unavailable. Log in and register to view live content