Outstanding Paper
Michal Derezinski · Rajiv Khanna · Michael Mahoney

[ Orals & Spotlights: Learning Theory ]

The Column Subset Selection Problem (CSSP) and the Nystrom method are among the leading tools for constructing small low-rank approximations of large datasets in machine learning and scientific computing. A fundamental question in this area is: how well can a data subset of size k compete with the best rank k approximation? We develop techniques which exploit spectral properties of the data matrix to obtain improved approximation guarantees which go beyond the standard worst-case analysis. Our approach leads to significantly better bounds for datasets with known rates of singular value decay, e.g., polynomial or exponential decay.
Our analysis also reveals an intriguing phenomenon: the approximation factor as a function of k may exhibit multiple peaks and valleys, which we call a multiple-descent curve.
A lower bound we establish shows that this behavior is not an artifact of our analysis, but rather it is an inherent property of the CSSP and Nystrom tasks. Finally, using the example of a radial basis function (RBF) kernel, we show that both our improved bounds and the multiple-descent curve can be observed on real datasets simply by varying the RBF parameter.

Outstanding Paper
Andrea Celli · Alberto Marchesi · Gabriele Farina · Nicola Gatti

[ Orals & Spotlights: Learning Theory ]

The existence of simple, uncoupled no-regret dynamics that converge to correlated equilibria in normal-form games is a celebrated result in the theory of multi-agent systems. Specifically, it has been known for more than 20 years that when all players seek to minimize their internal regret in a repeated normal-form game, the empirical frequency of play converges to a normal-form correlated equilibrium. Extensive-form (that is, tree-form) games generalize normal-form games by modeling both sequential and simultaneous moves, as well as private information. Because of the sequential nature and presence of partial information in the game, extensive-form correlation has significantly different properties than the normal-form counterpart, many of which are still open research directions. Extensive-form correlated equilibrium (EFCE) has been proposed as the natural extensive-form counterpart to normal-form correlated equilibrium. However, it was currently unknown whether EFCE emerges as the result of uncoupled agent dynamics. In this paper, we give the first uncoupled no-regret dynamics that converge to the set of EFCEs in n-player general-sum extensive-form games with perfect recall. First, we introduce a notion of trigger regret in extensive-form games, which extends that of internal regret in normal-form games. When each player has low trigger regret, the empirical frequency of play …

Test of Time Award
Benjamin Recht · Christopher Ré · Stephen Wright · Feng Niu

[ Orals & Spotlights: Optimization ]

Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called Hogwild which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then Hogwild achieves a nearly optimal rate of convergence. We demonstrate experimentally that Hogwild outperforms alternative schemes that use locking by an order of magnitude.

Outstanding Paper
Andrea Celli · Alberto Marchesi · Gabriele Farina · Nicola Gatti

[ Poster Session 1 ]

The existence of simple, uncoupled no-regret dynamics that converge to correlated equilibria in normal-form games is a celebrated result in the theory of multi-agent systems. Specifically, it has been known for more than 20 years that when all players seek to minimize their internal regret in a repeated normal-form game, the empirical frequency of play converges to a normal-form correlated equilibrium. Extensive-form (that is, tree-form) games generalize normal-form games by modeling both sequential and simultaneous moves, as well as private information. Because of the sequential nature and presence of partial information in the game, extensive-form correlation has significantly different properties than the normal-form counterpart, many of which are still open research directions. Extensive-form correlated equilibrium (EFCE) has been proposed as the natural extensive-form counterpart to normal-form correlated equilibrium. However, it was currently unknown whether EFCE emerges as the result of uncoupled agent dynamics. In this paper, we give the first uncoupled no-regret dynamics that converge to the set of EFCEs in n-player general-sum extensive-form games with perfect recall. First, we introduce a notion of trigger regret in extensive-form games, which extends that of internal regret in normal-form games. When each player has low trigger regret, the empirical frequency of play …

Outstanding Paper
Michal Derezinski · Rajiv Khanna · Michael Mahoney

[ Poster Session 4 ]

The Column Subset Selection Problem (CSSP) and the Nystrom method are among the leading tools for constructing small low-rank approximations of large datasets in machine learning and scientific computing. A fundamental question in this area is: how well can a data subset of size k compete with the best rank k approximation? We develop techniques which exploit spectral properties of the data matrix to obtain improved approximation guarantees which go beyond the standard worst-case analysis. Our approach leads to significantly better bounds for datasets with known rates of singular value decay, e.g., polynomial or exponential decay.
Our analysis also reveals an intriguing phenomenon: the approximation factor as a function of k may exhibit multiple peaks and valleys, which we call a multiple-descent curve.
A lower bound we establish shows that this behavior is not an artifact of our analysis, but rather it is an inherent property of the CSSP and Nystrom tasks. Finally, using the example of a radial basis function (RBF) kernel, we show that both our improved bounds and the multiple-descent curve can be observed on real datasets simply by varying the RBF parameter.

Outstanding Paper
Tom B Brown · Benjamin Mann · Nick Ryder · Melanie Subbiah · Jared Kaplan · Prafulla Dhariwal · Arvind Neelakantan · Pranav Shyam · Girish Sastry · Amanda Askell · Sandhini Agarwal · Ariel Herbert-Voss · Gretchen M Krueger · Tom Henighan · Rewon Child · Aditya Ramesh · Daniel Ziegler · Jeffrey Wu · Clemens Winter · Chris Hesse · Mark Chen · Eric Sigler · Mateusz Litwin · Scott Gray · Benjamin Chess · Jack Clark · Christopher Berner · Sam McCandlish · Alec Radford · Ilya Sutskever · Dario Amodei

[ Orals & Spotlights: Language/Audio Applications ]

We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.

Outstanding Paper
Tom B Brown · Benjamin Mann · Nick Ryder · Melanie Subbiah · Jared Kaplan · Prafulla Dhariwal · Arvind Neelakantan · Pranav Shyam · Girish Sastry · Amanda Askell · Sandhini Agarwal · Ariel Herbert-Voss · Gretchen M Krueger · Tom Henighan · Rewon Child · Aditya Ramesh · Daniel Ziegler · Jeffrey Wu · Clemens Winter · Chris Hesse · Mark Chen · Eric Sigler · Mateusz Litwin · Scott Gray · Benjamin Chess · Jack Clark · Christopher Berner · Sam McCandlish · Alec Radford · Ilya Sutskever · Dario Amodei

[ Poster Session 0 ]

We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.