Track: Orals & Spotlights Track 17: Kernel Methods/Optimization

Wed 9 Dec. 6:00 - 6:15 PST

Oral

Learning with Operator-valued Kernels in Reproducing Kernel Krein Spaces

Akash Saha · Balamurugan Palaniappan

Operator-valued kernels have shown promise in supervised learning problems with functional inputs and functional outputs. The crucial (and possibly restrictive) assumption of positive definiteness of operator-valued kernels has been instrumental in developing efficient algorithms. In this work, we consider operator-valued kernels which might not be necessarily positive definite. To tackle the indefiniteness of operator-valued kernels, we harness the machinery of Reproducing Kernel Krein Spaces (RKKS) of function-valued functions. A representer theorem is illustrated which yields a suitable loss stabilization problem for supervised learning with function-valued inputs and outputs. Analysis of generalization properties of the proposed framework is given. An iterative Operator based Minimum Residual (OpMINRES) algorithm is proposed for solving the loss stabilization problem. Experiments with indefinite operator-valued kernels on synthetic and real data sets demonstrate the utility of the proposed approach.

Wed 9 Dec. 6:15 - 6:30 PST

Oral

Kernel Methods Through the Roof: Handling Billions of Points Efficiently

Giacomo Meanti · Luigi Carratino · Lorenzo Rosasco · Alessandro Rudi

Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems, since naïve implementations scale poorly with data size. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware. Towards this end, we designed a preconditioned gradient solver for kernel methods exploiting both GPU acceleration and parallelization with multiple GPUs, implementing out-of-core variants of common linear algebra operations to guarantee optimal hardware utilization. Further, we optimize the numerical precision of different operations and maximize efficiency of matrix-vector multiplications. As a result we can experimentally show dramatic speedups on datasets with billions of points, while still guaranteeing state of the art performance. Additionally, we make our software available as an easy to use library.

Wed 9 Dec. 6:30 - 6:45 PST

Oral

A Group-Theoretic Framework for Data Augmentation

Shuxiao Chen · Edgar Dobriban · Jane Lee

Data augmentation has become an important part of modern deep learning pipelines and is typically needed to achieve state of the art performance for many learning tasks. It utilizes invariant transformations of the data, such as rotation, scale, and color shift, and the transformed images are added to the training set. However, these transformations are often chosen heuristically and a clear theoretical framework to explain the performance benefits of data augmentation is not available. In this paper, we develop such a framework to explain data augmentation as averaging over the orbits of the group that keeps the data distribution approximately invariant, and show that it leads to variance reduction. We study finite-sample and asymptotic empirical risk minimization and work out as examples the variance reduction in certain two-layer neural networks. We further propose a strategy to exploit the benefits of data augmentation for general learning tasks.

Wed 9 Dec. 6:45 - 7:00 PST

Break

Wed 9 Dec. 7:00 - 7:10 PST

Spotlight

A mathematical model for automatic differentiation in machine learning

Jérôme Bolte · Edouard Pauwels

Automatic differentiation, as implemented today, does not have a simple mathematical model adapted to the needs of modern machine learning. In this work we articulate the relationships between differentiation of programs as implemented in practice, and differentiation of nonsmooth functions. To this end we provide a simple class of functions, a nonsmooth calculus, and show how they apply to stochastic approximation methods. We also evidence the issue of artificial critical points created by algorithmic differentiation and show how usual methods avoid these points with probability one.

Wed 9 Dec. 7:10 - 7:20 PST

Spotlight

A kernel test for quasi-independence

Tamara Fernandez · Wenkai Xu · Marc Ditzhaus · Arthur Gretton

We consider settings in which the data of interest correspond to pairs of ordered times, e.g, the birth times of the first and second child, the times at which a new user creates an account and makes the first purchase on a website, and the entry and survival times of patients in a clinical trial. In these settings, the two times are not independent (the second occurs after the first), yet it is still of interest to determine whether there exists significant dependence "beyond" their ordering in time. We refer to this notion as "quasi-(in)dependence." For instance, in a clinical trial, to avoid biased selection, we might wish to verify that recruitment times are quasi-independent of survival times, where dependencies might arise due to seasonal effects. In this paper, we propose a nonparametric statistical test of quasi-independence. Our test considers a potentially infinite space of alternatives, making it suitable for complex data where the nature of the possible quasi-dependence is not known in advance. Standard parametric approaches are recovered as special cases, such as the classical conditional Kendall's tau, and log-rank tests. The tests apply in the right-censored setting: an essential feature in clinical trials, where patients can withdraw from the study. We provide an asymptotic analysis of our test-statistic, and demonstrate in experiments that our test obtains better power than existing approaches, while being more computationally efficient.

Wed 9 Dec. 7:20 - 7:30 PST

Spotlight

Fourier Sparse Leverage Scores and Approximate Kernel Learning

Tamas Erdelyi · Cameron Musco · Christopher Musco

We prove new explicit upper bounds on the leverage scores of Fourier sparse functions under both the Gaussian and Laplace measures. In particular, we study s-sparse functions of the form $f(x) = \sum_{j=1}^s a_j e^{i \lambda_j x}$ for coefficients $a_j \in C$ and frequencies $\lambda_j \in R$ . Bounding Fourier sparse leverage scores under various measures is of pure mathematical interest in approximation theory, and our work extends existing results for the uniform measure [Erd17,CP19a]. Practically, our bounds are motivated by two important applications in machine learning: 1. Kernel Approximation. They yield a new random Fourier features algorithm for approximating Gaussian and Cauchy (rational quadratic) kernel matrices. For low-dimensional data, our method uses a near optimal number of features, and its runtime is polynomial in the *statistical dimension* of the approximated kernel matrix. It is the first oblivious sketching method'' with this property for any kernel besides the polynomial kernel, resolving an open question of [AKM+17,AKK+20b]. 2. Active Learning. They can be used as non-uniform sampling distributions for robust active learning when data follows a Gaussian or Laplace distribution. Using the framework of [AKM+19], we provide essentially optimal results for bandlimited and multiband interpolation, and Gaussian process regression. These results generalize existing work that only applies to uniformly distributed data.

Wed 9 Dec. 7:30 - 7:40 PST

Spotlight

BOSS: Bayesian Optimization over String Spaces

Henry Moss · David Leslie · Daniel Beck · Javier González · Paul Rayson

This article develops a Bayesian optimization (BO) method which acts directly over raw strings, proposing the first uses of string kernels and genetic algorithms within BO loops. Recent applications of BO over strings have been hindered by the need to map inputs into a smooth and unconstrained latent space. Learning this projection is computationally and data-intensive. Our approach instead builds a powerful Gaussian process surrogate model based on string kernels, naturally supporting variable length inputs, and performs efficient acquisition function maximization for spaces with syntactic constraints. Experiments demonstrate considerably improved optimization over existing approaches across a broad range of constraints, including the popular setting where syntax is governed by a context-free grammar.

Wed 9 Dec. 7:40 - 7:50 PST

Q&A

Joint Q&A for Preceeding Spotlights

Wed 9 Dec. 7:50 - 8:00 PST

Spotlight

Fast geometric learning with symbolic matrices

Jean Feydy · Alexis Glaunès · Benjamin Charlier · Michael Bronstein

Geometric methods rely on tensors that can be encoded using a symbolic formula and data arrays, such as kernel and distance matrices. We present an extension for standard machine learning frameworks that provides comprehensive support for this abstraction on CPUs and GPUs: our toolbox combines a versatile, transparent user interface with fast runtimes and low memory usage. Unlike general purpose acceleration frameworks such as XLA, our library turns generic Python code into binaries whose performances are competitive with state-of-the-art geometric libraries - such as FAISS for nearest neighbor search - with the added benefit of flexibility. We perform an extensive evaluation on a broad class of problems: Gaussian modelling, K-nearest neighbors search, geometric deep learning, non-Euclidean embeddings and optimal transport theory. In practice, for geometric problems that involve 1k to 1M samples in dimension 1 to 100, our library speeds up baseline GPU implementations by up to two orders of magnitude.

Wed 9 Dec. 8:00 - 8:10 PST

Spotlight

Training Stronger Baselines for Learning to Optimize

Tianlong Chen · Weiyi Zhang · Zhou Jingyang · Shiyu Chang · Sijia Liu · Lisa Amini · Zhangyang Wang

Learning to optimize (L2O) is gaining increased attention because classical optimizers require laborious, problem-specific design and hyperparameter tuning. However, there are significant performance and practicality gaps between manually designed optimizers and existing L2O models. Specifically, learned optimizers are applicable to only a limited class of problems, often exhibit instability, and generalize poorly. As research efforts focus on increasingly sophisticated L2O models, we argue for an orthogonal, under-explored theme: improved training techniques for L2O models. We first present a progressive, curriculum-based training scheme, which gradually increases the optimizer unroll length to mitigate the well-known L2O dilemma of truncation bias (shorter unrolling) versus gradient explosion (longer unrolling). Secondly, we present an off-policy imitation learning based approach to guide the L2O learning, by learning from the behavior of analytical optimizers. We evaluate our improved training techniques with a variety of state-of-the-art L2O models and immediately boost their performance, without making any change to their model structures. We demonstrate that, using our improved training techniques, one of the earliest and simplest L2O models can be trained to outperform even the latest and most complex L2O models on a number of tasks. Our results demonstrate a greater potential of L2O yet to be unleashed, and prompt a reconsideration of recent L2O model progress. Our codes are publicly available at: https://github.com/VITA-Group/L2O-Training-Techniques.

Wed 9 Dec. 8:10 - 8:20 PST

Spotlight

Learning Linear Programs from Optimal Decisions

Yingcong Tan · Daria Terekhov · Andrew Delong

We propose a flexible gradient-based framework for learning linear programs from optimal decisions. Linear programs are often specified by hand, using prior knowledge of relevant costs and constraints. In some applications, linear programs must instead be learned from observations of optimal decisions. Learning from optimal decisions is a particularly challenging bilevel problem, and much of the related inverse optimization literature is dedicated to special cases. We tackle the general problem, learning all parameters jointly while allowing flexible parameterizations of costs, constraints, and loss functions. We also address challenges specific to learning linear programs, such as empty feasible regions and non-unique optimal decisions. Experiments show that our method successfully learns synthetic linear programs and minimum-cost multi-commodity flow instances for which previous methods are not directly applicable. We also provide a fast batch-mode PyTorch implementation of the homogeneous interior point algorithm, which supports gradients by implicit differentiation or backpropagation.

Wed 9 Dec. 8:20 - 8:30 PST

Spotlight

Automatically Learning Compact Quality-aware Surrogates for Optimization Problems

Kai Wang · Bryan Wilder · Andrew Perrault · Milind Tambe

Solving optimization problems with unknown parameters often requires learning a predictive model to predict the values of the unknown parameters and then solving the problem using these values. Recent work has shown that including the optimization problem as a layer in the model training pipeline results in predictions of the unobserved parameters that lead to higher decision quality. Unfortunately, this process comes at a large computational cost because the optimization problem must be solved and differentiated through in each training iteration; furthermore, it may also sometimes fail to improve solution quality due to non-smoothness issues that arise when training through a complex optimization layer. To address these shortcomings, we learn a low-dimensional surrogate model of a large optimization problem by representing the feasible space in terms of meta-variables, each of which is a linear combination of the original variables. By training a low-dimensional surrogate model end-to-end, and jointly with the predictive model, we achieve: i) a large reduction in training and inference time; and ii) improved performance by focusing attention on the more important variables in the optimization and learning in a smoother space. Empirically, we demonstrate these improvements on a non-convex adversary modeling task, a submodular recommendation task and a convex portfolio optimization task.

Wed 9 Dec. 8:30 - 8:40 PST

Q&A

Joint Q&A for Preceeding Spotlights

Wed 9 Dec. 8:40 - 9:00 PST

Break

Main Navigation

Session

Orals & Spotlights Track 17: Kernel Methods/Optimization

Chiranjib Bhattacharyya · Hossein Mobahi

Learning with Operator-valued Kernels in Reproducing Kernel Krein Spaces

Kernel Methods Through the Roof: Handling Billions of Points Efficiently

A Group-Theoretic Framework for Data Augmentation

Break

A mathematical model for automatic differentiation in machine learning

A kernel test for quasi-independence

Fourier Sparse Leverage Scores and Approximate Kernel Learning

BOSS: Bayesian Optimization over String Spaces

Joint Q&A for Preceeding Spotlights

Fast geometric learning with symbolic matrices

Training Stronger Baselines for Learning to Optimize

Learning Linear Programs from Optimal Decisions

Automatically Learning Compact Quality-aware Surrogates for Optimization Problems

Joint Q&A for Preceeding Spotlights

Break