Session
Optimization
The Marginal Value of Adaptive Gradient Methods in Machine Learning
Ashia C Wilson · Becca Roelofs · Mitchell Stern · Nati Srebro · Benjamin Recht
Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple over-parameterized problems, adaptive methods often find drastically different solutions than vanilla stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, SGD achieves zero test error, and AdaGrad and Adam attain test errors arbitrarily close to 1/2. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.
Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent
Xiangru Lian · Ce Zhang · Huan Zhang · Cho-Jui Hsieh · Wei Zhang · Ji Liu
Most distributed machine learning systems nowadays, including TensorFlow and CNTK, are built in a centralized fashion. One bottleneck of centralized algorithms lies on high communication cost on the central node. Motivated by this, we ask, can decentralized algorithms be faster than its centralized counterpart? Although decentralized PSGD (D-PSGD) algorithms have been studied by the control community, existing analysis and theory do not show any advantage over centralized PSGD (C-PSGD) algorithms, simply assuming the application scenario where only the decentralized network is available. In this paper, we study a D-PSGD algorithm and provide the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. This is because D-PSGD has comparable total computational complexities to C-PSGD but requires much less communication cost on the busiest node. We further conduct an empirical study to validate our theoretical analysis across multiple frameworks (CNTK and Torch), different network configurations, and computation platforms up to 112 GPUs. On network configurations with low bandwidth or high latency, D-PSGD can be up to one order of magnitude faster than its well-optimized centralized counterparts.
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization
Fabian Pedregosa · RĂ©mi Leblond · Simon Lacoste-Julien
Due to their simplicity and excellent performance, parallel asynchronous variants of stochastic gradient descent have become popular methods to solve a wide range of large-scale optimization problems on multi-core architectures.Yet, despite their practical success, support for nonsmooth objectives is still lacking, making them unsuitable for many problems of interest in machine learning, such as the Lasso, group Lasso or empirical risk minimization with box constraints. Key technical issues explain this paucity, both in the design of such algorithms and in their asynchronous analysis. In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse method inspired by SAGA, a variance reduced incremental gradient algorithm. The proposed method is easy to implement and significantly outperforms the state of the art on several nonsmooth, large-scale problems. We prove that our method achieves a theoretical linear speedup with respect to the sequential version under assumptions on the sparsity of gradients and block-separability of the proximal term. Empirical benchmarks on a multi-core architecture illustrate practical speedups of up to 13x on a 20-core machine.
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Alberto Bietti · Julien Mairal
Stochastic optimization algorithms with variance reduction have proven successful for minimizing large finite sums of functions. Unfortunately, these techniques are unable to deal with stochastic perturbations of input data, induced for example by data augmentation. In such cases, the objective is no longer a finite sum, and the main candidate for optimization is the stochastic gradient descent method (SGD). In this paper, we introduce a variance reduction approach for these settings when the objective is composite and strongly convex. The convergence rate outperforms SGD with a typically much smaller constant factor, which depends on the variance of gradient estimates only due to perturbations on a single example.
Process-constrained batch Bayesian optimisation
Pratibha Vellanki · Santu Rana · Sunil Gupta · David Rubin · Alessandra Sutti · Thomas Dorin · Murray Height · Paul Sanders · Svetha Venkatesh
Abstract Prevailing batch Bayesian optimisation methods allow all the control variables to be freely altered at each iteration. Real-world experiments, however, have physical limitations making it time-consuming to alter all settings for each recommendation in a batch. This gives rise to a unique problem in BO: in a recommended batch, a set of variables that are expensive to experimentally change need to be constrained and the remaining control variables are varied. We formulate this as process-constrained batch Bayesian optimisation problem. We propose algorithms pc-BO and pc-PEBO and show that the regret of pc-BO is sublinear. We demonstrate the performance of both pc-BO and pc-PEBO by optimising benchmark test functions, tuning hyper-parameters of the SVM classifier, optimising the heat-treatment process for an Al-Sc alloy to achieve target hardness, and optimising the short nano-fiber production process.
Safe Adaptive Importance Sampling
Sebastian Stich · Anant Raj · Martin Jaggi
Importance sampling has become an indispensable strategy to speed up optimization algorithms for large-scale applications. Improved adaptive variants -- using importance values defined by the complete gradient information which changes during optimization -- enjoy favorable theoretical properties, but are typically computationally infeasible. In this paper we propose an efficient approximation of gradient-based sampling, which is based on safe bounds on the gradient. The proposed sampling distribution is (i) provably the \emph{best sampling} with respect to the given bounds, (ii) always better than uniform sampling and fixed importance sampling and (iii) can efficiently be computed -- in many applications at negligible extra cost. The proposed sampling scheme is generic and can easily be integrated into existing algorithms. In particular, we show that coordinate-descent (CD) and stochastic gradient descent (SGD) can enjoy significant a speed-up under the novel scheme. The proven efficiency of the proposed sampling is verified by extensive numerical testing.
Beyond Worst-case: A Probabilistic Analysis of Affine Policies in Dynamic Optimization
Omar El Housni · Vineet Goyal
Affine policies (or control) are widely used as a solution approach in dynamic optimization where computing an optimal adjustable solution is usually intractable. While the worst case performance of affine policies can be significantly bad, the empirical performance is observed to be near-optimal for a large class of problem instances. For instance, in the two-stage dynamic robust optimization problem with linear covering constraints and uncertain right hand side, the worst-case approximation bound for affine policies is $O(\sqrt m)$ that is also tight (see Bertsimas and Goyal (2012)), whereas observed empirical performance is near-optimal. In this paper, we aim to address this stark-contrast between the worst-case and the empirical performance of affine policies. In particular, we show that affine policies give a good approximation for the two-stage adjustable robust optimization problem with high probability on random instances where the constraint coefficients are generated i.i.d. from a large class of distributions; thereby, providing a theoretical justification of the observed empirical performance. On the other hand, we also present a distribution such that the performance bound for affine policies on instances generated according to that distribution is $\Omega(\sqrt m)$ with high probability; however, the constraint coefficients are not i.i.d.. This demonstrates that the empirical performance of affine policies can depend on the generative model for instances.
Straggler Mitigation in Distributed Optimization Through Data Encoding
Can Karakus · Yifan Sun · Suhas Diggavi · Wotao Yin
Slow running or straggler tasks can significantly reduce computation speed in distributed computation. Recently, coding-theory-inspired approaches have been applied to mitigate the effect of straggling, through embedding redundancy in certain linear computational steps of the optimization algorithm, thus completing the computation without waiting for the stragglers. In this paper, we propose an alternate approach where we embed the redundancy in the data instead of the computation, and allow the nodes to operate completely oblivious to encoding. We propose several encoding schemes, and demonstrate that popular batch algorithms, such as gradient descent and L-BFGS, applied in a coding-oblivious manner, deterministically achieve sample path linear convergence to an approximate solution of the original problem, using an arbitrarily varying subset of the nodes at each iteration. Moreover, this approximation can be controlled by the choice of the encoding matrix and the number of nodes used in each iteration. We provide experimental results demonstrating the advantage of the approach over uncoded and replication strategies.