Timezone: »

I Can’t Believe It’s Not Better! Bridging the gap between theory and empiricism in probabilistic machine learning
Jessica Forde · Francisco Ruiz · Melanie Fernandez Pradier · Aaron Schein · Finale Doshi-Velez · Isabel Valera · David Blei · Hanna Wallach

Sat Dec 12 04:45 AM -- 02:45 PM (PST) @ None
Event URL: https://i-cant-believe-its-not-better.github.io/ »

We’ve all been there. A creative spark leads to a beautiful idea. We love the idea, we nurture it, and name it. The idea is elegant: all who hear it fawn over it. The idea is justified: all of the literature we have read supports it. But, lo and behold: once we sit down to implement the idea, it doesn’t work. We check our code for software bugs. We rederive our derivations. We try again and still, it doesn’t work. We Can’t Believe It’s Not Better [1].

In this workshop, we will encourage probabilistic machine learning researchers who Can’t Believe It’s Not Better to share their beautiful idea, tell us why it should work, and hypothesize why it does not in practice. We also welcome work that highlights pathologies or unexpected behaviors in well-established practices. This workshop will stress the quality and thoroughness of the scientific procedure, promoting transparency, deeper understanding, and more principled science.

Focusing on the probabilistic machine learning community will facilitate this endeavor, not only by gathering experts that speak the same language, but also by exploiting the modularity of probabilistic framework. Probabilistic machine learning separates modeling assumptions, inference, and model checking into distinct phases [2]; this facilitates criticism when the final outcome does not meet prior expectations. We aim to create an open-minded and diverse space for researchers to share unexpected or negative results and help one another improve their ideas.

Sat 4:45 a.m. - 5:00 a.m.
Intro (Welcome Intro)
Aaron Schein, Melanie F. Pradier
Sat 5:00 a.m. - 5:30 a.m.

Two years ago we embarked on a project called LIAR. LIAR was going to quantify uncertainty of a network through interval arithmetic (IA) calculations (which are an official IEEE standard). IA has the beautiful property that the answer of your computation is guaranteed to lie in a computed interval, and as such quantifies very precisely the numerical precision of your computation. Captured by this elegant idea we applied this to neural networks. In particular, the idea was to add a regularization term to the objective that would try to keep the interval of the network’s output small. This is particularly interesting in the context of quantization, where we quite naturally have intervals for the weights, activations and inputs due to their limited precision. By training a full precision neural network with intervals that represent the quantization error, and by encouraging the network to keep the resultant variation in the predictions small, we hoped to learn networks that were inherently robust to quantization noise. So far the good news. In this talk I will try to reconstruct the process of how the project ended up on the scrap pile. I will also try to produce some “lessons learned” from this project and hopefully deliver some advice for those who are going through a similar situation. I still can’t believe it didn’t work better ;-)

Max Welling
Sat 5:30 a.m. - 6:00 a.m.

This talk presents an overview of probabilistic graphical modelling as a strategy for understanding heterogeneous subgroups of patients. The identification of such subgroups may elucidate underlying causal mechanisms which may lead to more targeted treatment and intervention strategies. We will look at (1) the ideal of personalisation within the context of machine learning for healthcare (2) “From the ideal to the reality” and (3) some of the possible pathways to progress for making the ideal of personalised healthcare to reality. The last part of this talk focuses on the pipeline of personalisation and looks at probabilistic graphical models are part of a pipeline.

Danielle Belgrave
Sat 6:00 a.m. - 6:30 a.m.

This talk considers adding supervision to well-known generative latent variable models (LVMs), including both classic LVMs (e.g. mixture models, topic models) and more recent “deep” flavors (e.g. variational autoencoders). The standard way to add supervision to LVMs would be to treat the added label as another observed variable generated by the graphical model, and then maximize the joint likelihood of both labels and features. We find that across many models, this standard supervision leads to surprisingly negligible improvement in prediction quality over a more naive baseline that first fits an unsupervised model, and then makes predictions given that model’s learned low-dimensional representation. We can’t believe it is not better! Further, this problem is not properly solved by previous approaches that just upweight or “replicate” labels in the generative model (the problem is not just that we have more observed features than labels). Instead, we suggest the problem is related to model misspecification, and that the joint likelihood objective does not properly encode the desired performance goals at test time (we care about predicting labels from features, but not features from labels). This motivates a new training objective we call prediction constrained training, which can prioritize the label-from-feature prediction task while still delivering reasonable generative models for the observed features. We highlight promising results of our proposed prediction-constrained framework including recent extensions to semi-supervised VAEs and model-based reinforcement learning.

Mike Hughes
Sat 6:30 a.m. - 6:33 a.m.

The deep Gaussian mixture model (DGMM) is a framework directly inspired by the finite mixture of factor analysers model (MFA) and the deep learning architecture composed of multiple layers. The MFA is a generative model that considers a data point as arising from a latent variable (termed the score) which is sampled from a standard multivariate Gaussian distribution and then transformed linearly. The linear transformation matrix (termed the loading matrix) is specific to a component in the finite mixture. The DGMM consists of stacking MFA layers, in the sense that the latent scores are no longer assumed to be drawn from a standard Gaussian, but rather are drawn from a mixture of factor analysers model. Thus the latent scores are at one point considered to be the input of an MFA and also to have latent scores themselves. The latent scores of the DGMM's last layer only are considered to be drawn from a standard multivariate Gaussian distribution. In recent years, the DGMM gained prominence in the literature: intuitively, this model should be able to capture distributions more precisely than a simple Gaussian mixture model. We show in this work that while the DGMM is an original and novel idea, in certain cases it is challenging to infer its parameters. In addition, we give some insights to the probable reasons of this difficulty. Experimental results are provided on github: https://github.com/ansubmissions/ICBINB, alongside an R package that implements the algorithm and a number of ready-to-run R scripts.

Margot Selosse
Sat 6:33 a.m. - 6:36 a.m.

Scientists and engineers are often interested in learning the number of subpopulations (or components) present in a data set. Data science folk wisdom tells us that a finite mixture model (FMM) with a prior on the number of components will fail to recover the true, data-generating number of components under model misspecification. But practitioners still widely use FMMs to learn the number of components, and statistical machine learning papers can be found recommending such an approach. Increasingly, though, data science papers suggest potential alternatives beyond vanilla FMMs, such as power posteriors, coarsening, and related methods. In this work we start by adding rigor to folk wisdom and proving that, under even the slightest model misspecification, the FMM component-count posterior diverges: the posterior probability of any particular finite number of latent components converges to 0 in the limit of infinite data. We use the same theoretical techniques to show that power posteriors with fixed power face the same undesirable divergence, and we provide a proof for the case where the power converges to a non-zero constant. We illustrate the practical consequences of our theory on simulated and real data. We conjecture how our methods may be applied to lend insight into other component-count robustification techniques.

Diana Cai
Sat 6:36 a.m. - 6:39 a.m.

The power of neural networks lies in their ability to generalize to unseen data, yet the underlying reasons for this phenomenon remain elusive. Numerous rigorous attempts have been made to explain generalization, but available bounds are still quite loose, and analysis does not always lead to true understanding. The goal of this work is to make generalization more intuitive. Using visualization methods, we discuss the mystery of generalization, the geometry of loss landscapes, and how the curse (or, rather, the blessing) of dimensionality causes optimizers to settle into minima that generalize well.

W. Ronny Huang
Sat 6:39 a.m. - 6:42 a.m.

We identify a fundamental drawback of natural extensions of Upper Confidence Bound (UCB) algorithms to the multi-agent bandit problem in which multiple agents facing the same explore-exploit problem can share information. We provide theoretical guarantees that when agents use a natural extension of the UCB sampling rule, sharing information about the optimal option degrades their performance. For K the number of agents and T the time horizon, we prove that when agents share information only about the optimal option they suffer an expected group cumulative regret of O(KlogT + KlogK), whereas when they do not share any information they only suffer a group regret of O(KlogtT). Further, while information sharing about all options yields much better performance than with no information sharing, we show that including information about the optimal option is not as good as sharing information only about suboptimal options.

Udari Madhushani
Sat 6:42 a.m. - 6:45 a.m.

Selective classification, in which models are allowed to abstain on uncertain predictions, is a natural approach to improving accuracy in settings where errors are costly but abstentions are manageable. In this paper, we find that while selective classification can improve average accuracies, it can simultaneously magnify existing accuracy disparities between various groups within a population, especially in the presence of spurious correlations. We observe this behavior consistently across five datasets from computer vision and NLP. Surprisingly, increasing the abstention rate can even decrease accuracies on some groups. To better understand when selective classification improves or worsens accuracy on a group, we study its margin distribution, which captures the model’s confidences over all predictions. For example, when the margin distribution is symmetric, we prove that whether selective classification monotonically improves or worsens accuracy is fully determined by the accuracy at full coverage (i.e., without any abstentions) and whether the distribution satisfies a property we term left-log-concavity. Our analysis also shows that selective classification tends to magnify accuracy disparities that are present at full coverage. Fortunately, we find that it uniformly improves each group when applied to distributionally-robust models that achieve similar full-coverage accuracies across groups. Altogether, our results imply selective classification should be used with care and underscore the importance of models that perform equally well across groups at full coverage.

Erik Jones
Sat 6:45 a.m. - 6:48 a.m.

Recent advances in modeling multiagent trajectories combine graph architectures such as graph neural networks (GNNs) with conditional variational models (CVMs) such as variational RNNs (VRNNs). Originally, CVMs have been proposed to facilitate learning with multi-modal and structured data and thus seem to perfectly match the requirements of multi-modal multiagent trajectories with their structured output spaces. Empirical results of VRNNs on trajectory data support this assumption. In this paper, we revisit experiments and proposed architectures with additional rigour, ablation runs and baselines. In contrast to common belief, we show that both, historic and current results with CVMs on trajectory data are misleading. Given a neural network with a graph architecture and/or structured output function, variational autoencoding does not contribute statistically significantly to empirical performance. Instead, we show that well-known emission functions do contribute, while coming with less complexity, engineering and computation time.

Yannick Rudolph
Sat 6:50 a.m. - 7:00 a.m.
Coffee Break (Gather.town available: https://bit.ly/3gxkLA7) (Coffee Break)
Sat 7:00 a.m. - 8:00 a.m.

Link to access the gather town: https://bit.ly/3gxkLA7

Sat 8:00 a.m. - 8:15 a.m.

Thanks to the tractability of their likelihood, some deep generative models show promise for seemingly straightforward but important applications like anomaly detection, uncertainty estimation, and active learning. However, the likelihood values empirically attributed to anomalies conflict with the expectations these proposed applications suggest. In this paper, we take a closer look at the behavior of distribution densities and show that these quantities carry less meaningful information than previously thought, beyond estimation issues or the curse of dimensionality. We conclude that the use of these likelihoods for out-of-distribution detection relies on strong and implicit hypotheses and highlight the necessity of explicitly formulating these assumptions for reliable anomaly detection.

Charline Le Lan
Sat 8:15 a.m. - 8:30 a.m.

The learning and evaluation of energy-based latent variable models (EBLVMs) without any structural assumptions are highly challenging, because the true posteriors and the partition functions in such models are generally intractable. This paper presents variational estimates of the score function and its gradient with respect to the model parameters in a general EBLVM, referred to as VaES and VaGES respectively. The variational posterior is trained to minimize a certain divergence to the true model posterior and the bias in both estimates can be bounded by the divergence theoretically. With a minimal model assumption, VaES and VaGES can be applied to the kernelized Stein discrepancy (KSD) and score matching (SM)-based methods to learn EBLVMs. Besides, VaES can also be used to estimate the exact Fisher divergence between the data and general EBLVMs.

Fan Bao
Sat 8:30 a.m. - 8:45 a.m.

Bayesian Reinforcement Learning (BRL) offers a decision-theoretic solution to the reinforcement learning problem. While ''model-based'' BRL algorithms have focused either on maintaining a posterior distribution on models, BRL ''model-free'' methods try to estimate value function distributions but make strong implicit assumptions or approximations. We describe a novel Bayesian framework, \emph{inferential induction}, for correctly inferring value function distributions from data, which leads to a new family of BRL algorithms. We design an algorithm, Bayesian Backwards Induction (BBI), with this framework. We experimentally demonstrate that BBI is competitive with the state of the art. However, its advantage relative to existing BRL model-free methods is not as great as we have expected, particularly when the additional computational burden is taken into account.

Emilio Jorge
Sat 9:00 a.m. - 10:00 a.m.
Lunch Break (Gather.town available: https://bit.ly/3gxkLA7) (Lunch)
Sat 10:00 a.m. - 10:30 a.m.

We can’t fit the models we want to fit because it takes too long to fit them on our computer. Also, we don’t know what models we want to fit until we try a few. I share some stories of struggles with data-partitioning and parameter-partitioning algorithms, what kinda worked and what didn’t.

Andrew Gelman
Sat 10:30 a.m. - 11:00 a.m.

In the pre-AlexNet days of deep learning, second-order optimization gave dramatic speedups and enabled training of deep architectures that seemed to be inaccessible to first-order optimization. But today, despite algorithmic advances such as K-FAC, nearly all modern neural net architectures are trained with variants of SGD and Adam. What’s holding us back from using second-order optimization? I’ll discuss three challenges to applying second-order optimization to modern neural nets: difficulty of implementation, implicit regularization effects of gradient descent, and the effect of gradient noise. All of these factors are significant, though not in the ways commonly believed.

Roger Grosse
Sat 11:00 a.m. - 11:30 a.m.

While deep learning has demonstrable success on many tasks, the point estimates provided by standard deep models can lead to overfitting and provide no uncertainty quantification on predictions. However, when models are applied to critical domains such as autonomous driving, precision health care, or criminal justice, reliable measurements of a model’s predictive uncertainty may be as crucial as correctness of its predictions. In this talk, we examine a number of deep (Bayesian) models that promise to capture complex forms for predictive uncertainties, we also examine metrics commonly used to such uncertainties. We aim to highlight strengths and limitations of these models as well as the metrics; we also discuss ideas to improve both in meaningful ways for downstream tasks.

Weiwei Pan
Sat 11:30 a.m. - 11:33 a.m.

Isotropic Gaussian priors are the de facto standard for modern Bayesian neural network inference. However, there has been recent controversy over the question whether they might be to blame for the undesirable cold posterior effect. We study this question empirically and find that for densely connected networks, Gaussian priors are indeed less well suited than more heavy-tailed ones. Conversely, for convolutional architectures, Gaussian priors seem to perform well and thus cannot fully explain the cold posterior effect. These findings coincide with the empirical maximum-likelihood weight distributions discovered by standard gradient-based training.

Vincent Fortuin
Sat 11:33 a.m. - 11:36 a.m.

The recent, counter-intuitive discovery that deep generative models (DGMs) can frequently assign a higher likelihood to outliers has implications for both outlier detection applications as well as our overall understanding of generative modeling. In this work, we present a possible explanation for this phenomenon, starting from the observation that a model's typical set and high-density region may not coincide. From this vantage point we propose a novel outlier test, the empirical success of which suggests that the failure of existing likelihood-based outlier tests does not necessarily imply that the corresponding generative model is uncalibrated. We also conduct additional experiments to help disentangle the impact of low-level texture versus high-level semantics in differentiating outliers. In aggregate, these results suggest that modifications to the standard evaluation practices and benchmarks commonly applied in the literature are needed.

Ziyu Wang
Sat 11:36 a.m. - 11:39 a.m.

Reducing bias while learning and inference is an important requirement to achieve generalizable and better performing models. The method of stacking took the first step towards creating such models by reducing inference bias but the question of combining stacking with a model that reduces learning bias is still largely unanswered. In statistical relational learning, ensemble models of relational trees such as boosted relational dependency networks (RDN-Boost) are shown to reduce the learning bias. We combine RDN-Boost and stacking methods with the aim of reducing both learning and inference bias subsequently resulting in better overall performance. However, our evaluation on three relational data sets shows no significant performance improvement over the baseline models.

Siwen Yan
Sat 11:39 a.m. - 11:42 a.m.

Recent advancements in deep generative modeling make it possible to learn prior distributions from complex data that subsequently can be used for Bayesian inference. However, we find that distributions learned by deep generative models for audio signals do not exhibit the right properties that are necessary for tasks like audio source separation using a probabilistic approach. We observe that the learned prior distributions are either discriminative and extremely peaked or smooth and non-discriminative. We quantify this behavior for two types of deep generative models on two audio datasets.

Maurice Frank
Sat 11:42 a.m. - 11:45 a.m.

In practice, machine learning experts are often confronted with imbalanced data. Without accounting for the imbalance, common classifiers perform poorly and standard evaluation metrics mislead the practitioners on the model's performance. A common method to treat imbalanced datasets is under- and oversampling. In this process, samples are either removed from the majority class or synthetic samples are added to the minority class. In this paper, we follow up on recent developments in deep learning. We take proposals of deep generative models, and study the ability of these approaches to provide realistic samples that improve performance on imbalanced classification tasks via oversampling. Across 160K+ experiments, we show that the improvements in terms of performance metric, while shown to be significant when ranking the methods like in the literature, often are minor in absolute terms, especially compared to the required effort. Furthermore, we notice that a large part of the improvement is due to undersampling, not oversampling.

Ramiro Camino
Sat 11:45 a.m. - 11:48 a.m.

Actor-Critic methods are a prominent class of modern reinforcement learning algorithms based on the classic Policy Iteration procedure. Despite many successful cases, Actor-Critic methods tend to require a gigantic number of experiences and can be very unstable. Recent approaches have advocated learning and using a world model to improve sample efficiency and reduce reliance on the value function estimate. However, learning an accurate dynamics model of the world remains challenging, often requiring computationally costly and data-hungry models. More recent work has shown that learning an everywhere accurate model is unnecessary and often detrimental to the overall task; instead, the agent should improve the world model on task-critical regions. For example, in Iterative Value-Aware Model Learning, the authors extend model-based value iteration by incorporating the value function (estimate) into the model loss function, showing the novel model objective reflects improved performance in the end task. Therefore, it seems natural to expect that model-based Actor-Critic methods can benefit equally from learning value-aware models, improving overall task performance, or reducing the need for large, expensive models. However, we show empirically that combining Actor-Critic and value-aware model learning can be quite difficult and that naive approaches such as maximum likelihood estimation often achieve superior performance with less computational cost. Our results suggest that, despite theoretical guarantees, learning a value-aware model in continuous domains does not ensure better performance on the overall task.

Ângelo Lovatto
Sat 11:50 a.m. - 12:00 p.m.
Coffee Break (Gather.town available: https://bit.ly/3gxkLA7) (Break)
Sat 12:00 p.m. - 12:15 p.m.

Bayesian nonparametric models based on completely random measures (CRMs) offers flexibility when the number of clusters or latent components in a data set is unknown. However, managing the infinite dimensionality of CRMs often leads to slow computation during inference. Practical inference typically relies on either integrating out the infinite-dimensional parameter or using a finite approximation: a truncated finite approximation (TFA) or an independent finite approximation (IFA). The atom weights of TFAs are constructed sequentially, while the atoms of IFAs are independent, which facilitates more convenient inference schemes. While the approximation error of TFA has been systematically addressed, there has not yet been a similar study of IFA. We quantify the approximation error between IFAs and two common target nonparametric priors (beta-Bernoulli process and Dirichlet process mixture model) and prove that, in the worst-case, TFAs provide more component-efficient approximations than IFAs. However, in experiments on image denoising and topic modeling tasks with real data, we find that the error of Bayesian approximation methods overwhelms any finite approximation error, and IFAs perform very similarly to TFAs.

Stan Nguyen
Sat 12:15 p.m. - 12:30 p.m.

Standard first-order stochastic optimization algorithms base their updates solely on the average mini-batch gradient, and it has been shown that tracking additional quantities such as the curvature can help de-sensitize common hyperparameters. Based on this intuition, we explore the use of exact per-sample Hessian-vector products and gradients to construct optimizers that are self-tuning and hyperparameter-free. Based on a dynamics model of the gradient, we derive a process which leads to a curvature-corrected, noise-adaptive online gradient estimate. The smoothness of our updates makes it more amenable to simple step size selection schemes, which we also base off of our estimates quantities. We prove that our model-based procedure converges in the noisy quadratic setting. Though we do not see similar gains in deep learning tasks, we can match the performance of well-tuned optimizers and ultimately, this is an interesting step for constructing self-tuning optimizers.

Ricky T. Q. Chen
Sat 12:30 p.m. - 12:45 p.m.

Modern deep learning is primarily an experimental science, in which empirical advances occasionally come at the expense of probabilistic rigor. Here we focus on one such example; namely the use of the categorical cross-entropy loss to model data that is not strictly categorical, but rather takes values on the simplex. This practice is standard in neural network architectures with label smoothing and actor-mimic reinforcement learning, amongst others. Drawing on the recently discovered {continuous-categorical} distribution, we propose probabilistically-inspired alternatives to these models, providing an approach that is a more principled and theoretically appealing. Through careful experimentation, including an ablation study, we identify the potential for outperformance in these models, thereby highlighting the importance of a proper probabilistic treatment, as well as illustrating some of the failure modes thereof.

Elliott Gordon-Rodriguez
Sat 12:45 p.m. - 1:45 p.m.

Link to access the Gather.town: https://neurips.gather.town/app/5163xhrHdSWrUZsG/ICBINB

Sat 1:15 p.m. - 1:45 p.m.

Link to access the Gather.town: https://neurips.gather.town/app/5163xhrHdSWrUZsG/ICBINB

Sat 1:45 p.m. - 2:45 p.m.

A panel discussion moderated by Hanna Wallach (MSR New York).

Panelists: -- Tamera Broderick (MIT) -- Laurent Dinh (Google Brain) -- Neil Lawrence (Cambridge) -- Kristian Lum (Human Rights Data Analysis Group) -- Sinead Williamson (UT Austin)

Tamara Broderick, Laurent Dinh, Neil Lawrence, Kristian Lum, Hanna Wallach, Sinead Williamson

Author Information

Jessica Forde (Brown University)
Francisco Ruiz (DeepMind)
Melanie Fernandez Pradier (Microsoft Research)
Aaron Schein (Columbia University)
Finale Doshi-Velez (Harvard)
Isabel Valera (Saarland University & MPI for Intelligent Systems)
David Blei (Columbia University)

David Blei is a Professor of Statistics and Computer Science at Columbia University, and a member of the Columbia Data Science Institute. His research is in statistical machine learning, involving probabilistic topic models, Bayesian nonparametric methods, and approximate posterior inference algorithms for massive data. He works on a variety of applications, including text, images, music, social networks, user behavior, and scientific data. David has received several awards for his research, including a Sloan Fellowship (2010), Office of Naval Research Young Investigator Award (2011), Presidential Early Career Award for Scientists and Engineers (2011), Blavatnik Faculty Award (2013), and ACM-Infosys Foundation Award (2013). He is a fellow of the ACM.

Hanna Wallach (Microsoft)

More from the Same Authors