Timezone: »

Your Model is Wrong: Robustness and misspecification in probabilistic modeling
Diana Cai · Sameer Deshpande · Michael Hughes · Tamara Broderick · Trevor Campbell · Nick Foti · Barbara Engelhardt · Sinead Williamson

Tue Dec 14 04:55 AM -- 02:40 PM (PST) @ None
Event URL: https://sites.google.com/view/robustbayes-neurips21/home »

Probabilistic modeling is a foundation of modern data analysis -- due in part to the flexibility and interpretability of these methods -- and has been applied to numerous application domains, such as the biological sciences, social and political sciences, engineering, and health care. However, any probabilistic model relies on assumptions that are necessarily a simplification of complex real-life processes; thus, any such model is inevitably misspecified in practice. In addition, as data set sizes grow and probabilistic models become more complex, applying a probabilistic modeling analysis often relies on algorithmic approximations, such as approximate Bayesian inference, numerical approximations, or data summarization methods. Thus in many cases, approximations used for efficient computation lead to fitting a misspecified model by design (e.g., variational inference). Importantly, in some cases, this misspecification leads to useful model inferences, but in others it may lead to misleading and potentially harmful inferences that may then be used for important downstream tasks for, e.g., making scientific inferences or policy decisions.

The goal of the workshop is to bring together researchers focused on methods, applications, and theory to outline some of the core problems in specifying and applying probabilistic models in modern data contexts along with current state-of-the-art solutions. Participants will leave the workshop with (i) exposure to recent advances in the field, (ii) an idea of the current major challenges in the field, and (iii) an introduction to methods meeting these challenges. These goals will be accomplished through a series of invited and contributed talks, poster spotlights, poster sessions, as well as ample time for discussion and live Q&A.

Tue 4:55 a.m. - 5:00 a.m.
Welcome remarks (Talk)
Diana Cai
Tue 5:00 a.m. - 5:30 a.m.
Invited Talk 1 (Talk)
Chris C Holmes
Tue 5:30 a.m. - 5:35 a.m.
Invite Talk 1 Q&A (Q&A)
Tue 5:35 a.m. - 6:05 a.m.
Invited Talk 2 (Talk)
Ilse Ipsen
Tue 6:05 a.m. - 6:10 a.m.
Invited Talk 2 Q&A (Q&A)
Tue 6:10 a.m. - 6:45 a.m.
Individual discussions in Gathertown (Gathertown discussion)
Tue 6:45 a.m. - 7:00 a.m.
Contributed talk 1 (Talk)
Michail Spitieris
Tue 7:00 a.m. - 7:15 a.m.
Contributed talk 2 (Talk)
Masha Naslidnyk
Tue 10:30 a.m. - 10:45 a.m.
Contributed talk 3 (Talk)
Maria Cervera
Tue 10:45 a.m. - 11:00 a.m.
Contributed talk 4 (Talk)
Jackson Killian
Tue 11:00 a.m. - 11:30 a.m.
Invited Talk 3 (Talk)
Andres Masegosa
Tue 11:00 a.m. - 11:05 a.m.
Invited Talk 3 Q&A (Q&A)
Tue 11:35 a.m. - 11:40 a.m.
Invited Talk 4 Q&A (Q&A)
Tue 12:00 p.m. - 12:15 p.m.
Contributed talk 5 (Talk)
Tue 12:15 p.m. - 12:30 p.m.
Contributed talk 6 (Talk)
Eli N Weinstein
Tue 12:30 p.m. - 1:00 p.m.
Invited Talk 4 (Talk)
Jonathan Huggins
Tue 1:30 p.m. - 2:00 p.m.
Invited Talk 5 (Talk)
Lester Mackey
Tue 2:00 p.m. - 2:05 p.m.
Invited Talk 5 Q&A (Q&A)
Tue 2:05 p.m. - 2:35 p.m.
Invited Talk 6 (Talk)
Yixin Wang
Tue 2:35 p.m. - 2:40 p.m.
Invited Talk 6 Q&A (Q&A)

Insights into complex, high-dimensional data can be obtained by discovering features of the data that match or do not match a model of interest. To formalize this task, we introduce the "data selection" problem: finding a lower-dimensional statistic - such as a subset of variables - that is well fit by a given parametric model of interest. A fully Bayesian approach to data selection would be to parametrically model the value of the statistic, nonparametrically model the remaining "background" components of the data, and perform standard Bayesian model selection for the choice of statistic. However, fitting a nonparametric model to high-dimensional data tends to be highly inefficient, statistically and computationally. We propose a novel score for performing data selection, the "Stein volume criterion (SVC)", that does not require fitting a nonparametric model. The SVC takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in place of the Kullback--Leibler divergence. We prove that the SVC is consistent for data selection. We apply the SVC to the analysis of single-cell RNA sequencing datasets using a spin glass model of gene regulation.

Eli N Weinstein, Jeffrey Miller

Although neural networks are powerful function approximators, the underlying modelling assumptions ultimately define the likelihood and thus the model class they are parameterizing. In classification, these assumptions are minimal as the commonly employed softmax is capable of representing any discrete distribution over a finite set of outcomes. In regression, however, restrictive assumptions on the type of continuous distribution to be realized are typically placed, like the dominant choice of training via mean-squared error and its underlying Gaussianity assumption. Recently, modelling advances allow to be agnostic to the type of continuous distribution to be modelled, granting regression the flexibility of classification models. While past studies stress the benefit of such flexible regression models in terms of performance, here we study the effect of the model choice on uncertainty estimation. We highlight that under model misspecification, aleatoric uncertainty is not properly captured, and that a Bayesian treatment of a misspecified model leads to unreliable epistemic uncertainty estimates. Overall, our study provides an overview on how modelling choices in regression may influence uncertainty estimation and thus any downstream decision making process.

Maria Cervera, Rafael Dätwyler, Francesco D'Angelo, Hamza Keurti, Benjamin F. Grewe, Christian Henning

BayesBag has been established as a useful tool for robust Bayesian model selection. However, computing BayesBag can be prohibitively expensive for large datasets. Here, we propose a fast approximation of BayesBag model selection. This approximation---based on Taylor approximations of the log marginal likelihood---can achieve results comparable to BayesBag in a fraction of the time.

Neil Spencer, Jeffrey Miller

Ensembles are widely used in machine learning and, usually, provide state-of-the-art performance in many prediction tasks. From the very beginning, diversity of ensemble members has been identified as a key factor for the superior performance of an ensemble. But the exact role that diversity plays in an ensemble model is not fully understood and is still an open question. In this work, we employ a second order PAC-Bayesian analysis to shed light on this problem in the context of neural network ensembles. More precisely, we provide sound theoretical answers to the following questions: how to measure diversity, how diversity relates to the generalization error and how diversity can be promoted by ensemble learning algorithms. This analysis covers three widely used loss functions, namely, the squared loss, the cross-entropy loss, and the 0-1 loss; and two widely used model combination strategies, namely, model averaging and weighted majority vote. We empirically validate this theoretical analysis on ensembles of neural networks.

Luis Antonio Ortega Andrés, Andres Masegosa, Rafael Cabañas Cabañas

This work proposes and evaluates a shared parameter model (SPM) to account for data being missing not at random (MNAR) for a predictive model based on a longitudinal population study. The aim is to model systolic blood pressure ten years ahead based on current observations and is inspired by and evaluated for data from the Nord-Trøndelag Health Study (HUNT). The proposed SPM consists of a linear model for the systolic blood pressure and a logistic model for the drop-out process connected through a shared random effect. To evaluate the SPM we compare the parameter estimates and predictions of the SPM with a naive linear Bayesian model using the same explanatory variables while ignoring the drop-out process. This corresponds to assuming data to be missing at random (MAR). In addition, a simulation study is performed in which the naive model and the SPM are tested on data with known parameters when missingness is assumed to be MNAR. The SPM indicates that participants with higher systolic blood pressure than expected from the explanatory variables at the time of the follow-up study have a higher probability of dropping out, suggesting that the data are MNAR. Further, the SPM and the naive model result in different parameter estimates for the explanatory variables. The simulation study validates that the SPM is identifiable for the estimates obtained by the predictive model based on the HUNT study

Aurora Hofman

BCART (Bayesian Classification and Regression Trees) and BART (Bayesian Additive Regression Trees) are popular modern regression models. Their popularity is intimately tied to the ability to flexibly model complex responses depending on high-dimensional inputs while simultaneously being able to quantify uncertainties. However, surprisingly little work has been done to evaluate the sensitivity of these modern regression models to violations of modeling assumptions. In particular, we consider influential observations and propose methods for detecting influentials and adjusting predictions to not be unduly affected by such problematic data. We consider two detection diagnostics for Bayesian tree models, one an analogue of Cook's distance and the other taking the form of a divergence measure, and then propose an importance sampling algorithm to re-weight previously sampled posterior draws so as to remove the effects of influential data. Finally, our methods are demonstrated on real-world data where blind application of models can lead to poor predictions.

Matthew Pratola

Bayesian quadrature (BQ) is a model-based numerical integration method that is able to increase sample efficiency by encoding and leveraging known structure of the integration task at hand. In this paper, we explore priors that encode invariance of the integrand under a set of bijective transformations in the input domain, in particular some unitary transformations, such as rotations, axis-flips, or point symmetries. We show initial results on superior performance in comparison to standard Bayesian quadrature on several synthetic and one real world application.

Masha Naslidnyk, Javier González, Maren Mahsereci

Model misspecification can create significant challenges for the implementation of probabilistic models, and this has led to development of a range of inference methods which directly account for this issue. However, whether these more involved methods are required will depend on whether the model is really misspecified, and there is a lack of generally applicable methods to answer this question. One set of tools which can help are goodness-of-fit tests, where we test whether a dataset could have been generated by a fixed distribution. Kernel-based tests have been developed to for this problem, and these are popular due to their flexibility, strong theoretical guarantees and ease of implementation in a wide range of scenarios. In this paper, we extend this line of work to the more challenging composite goodness-of-fit problem, where we are instead interested in whether the data comes from any distribution in some parametric family. This is equivalent to testing whether a parametric model is well-specified for the data.

Oscar Key, Tamara Fernandez, Arthur Gretton, Francois-Xavier Briol

Gaussian Mixture Model (GMM) is a widely used probabilistic model for clustering. In many practical settings, the true data distribution, which is unknown, may be non-Gaussian and may be contaminated by noise or outliers. In such cases, clustering may still be done with a misspecified GMM. However, this may lead to incorrect classification of the underlying subpopulations. In this work, we examine the performance of both Expectation Maximization (EM) and Gradient Descent (GD) on unconstrained Gaussian Mixture Models when there is misspecification. Our simulation study reveals a previously unreported class of \textit{inferior} clustering solutions, different from spurious solutions, that occurs due to asymmetry in the fitted component variances.

Siva Rajesh Kasa, Vaibhav Rajan

Abstract Statistical tasks such as density estimation and approximate Bayesian inference often involve densities with unknown normalising constants. Score-based methods, including score matching, are popular techniques as they are free of normalising constants. Although these methods enjoy theoretical guarantees, a little-known fact is that they suffer from practical failure modes when the unnormalised distribution of interest has isolated components --- they cannot discover isolated components or identify the correct mixing proportions between components. We demonstrate these findings using simple distributions and present heuristic attempts to address these issues. We hope to bring the attention of theoreticians and practitioners to these issues when developing new algorithms and applications.

Li Kevin Wenliang, Heishiro Kanagawa

Markov chain Monte Carlo (MCMC) methods are a powerful tool in Bayesian computation. They provide asymptotically consistent estimates as the number of iterations tends to infinity. However, in large data applications, MCMC can be computationally expensive per iteration. This has catalyzed interest in sampling methods such as approximate MCMC, which trade off asymptotic consistency for improved computational speed. In this article, we propose estimators based on couplings of Markov chains to assess the quality of such asymptotically biased sampling methods. The estimators give empirical upper bounds of the Wassertein distance between the limiting distribution of the asymptotically biased sampling method and the original target distribution of interest. We establish theoretical guarantees for our upper bounds and show that our estimators can remain effective in high dimensions. We apply our sample quality measures to stochastic gradient MCMC, variational Bayes, and Laplace approximations for tall data and to approximate MCMC for high-dimensional linear regression and high-dimensional logistic regression.

Niloy Biswas, Lester Mackey

We introduce the semi-adversarial framework for sequential prediction with expert advice, where data are generated from distributions varying arbitrarily within an unknown constraint set. We quantify relaxations of the classical i.i.d. assumption along a spectrum induced by this framework, with i.i.d. sequences at one extreme and adversarial mechanisms at the other. The Hedge algorithm, which corresponds to using an expert-valued Bayesian power posterior to make decisions, was recently shown to be simultaneously optimal for both i.i.d. and adversarial data. We demonstrate that Hedge is suboptimal at all points of the spectrum in between these endpoints. Further, we introduce a novel algorithm and prove that it achieves the minimax optimal rate of regret at all points along the semi-adversarial spectrum---without advance knowledge of the constraint set. This algorithm corresponds to follow-the-regularized-leader, constructed by replacing the Shannon entropy regularizer of Hedge with the square-root of the Shannon entropy.

Blair Bilodeau, Jeffrey Negrea, Daniel Roy

Gaussian processes (GPs) are used to make medical and scientific decisions, including in cardiac care and monitoring of carbon dioxide emissions. But the choice of GP kernel is often somewhat arbitrary. In particular, uncountably many kernels typically align with qualitative prior knowledge (e.g. function smoothness or stationarity). But in practice, data analysts choose among a handful of convenient standard kernels (e.g. squared exponential). In the present work, we ask: Would decisions made with a GP differ under other, qualitatively interchangeable kernels? We show how to formulate this sensitivity analysis as a constrained optimization problem over a finite-dimensional space. We can then use standard optimizers to identify substantive changes in relevant decisions made with a GP. We demonstrate in both synthetic and real-world examples that decisions made with a GP can exhibit substantial sensitivity to kernel choice, even when prior draws are qualitatively interchangeable to a user.

Will Stephenson, Soumya Ghosh, Tin Nguyen, Mikhail Yurochkin, Sameer Deshpande, Tamara Broderick

Simulator-based models are models for which the likelihood is intractable but simulation of synthetic data is possible. Such models are often used to describe complex real-world phenomena, and as such can often be misspecified in practice. Unfortunately, existing Bayesian approaches for simulators are known to perform poorly in misspecified settings. In this paper, we propose a novel approach based on the posterior bootstrap which gives a highly-parallelisable Bayesian inference algorithm for simulator-based models. Our approach is based on maximum mean discrepancy estimators, which also allows us to inherit their robustness properties.

Charita Dellaporta, Jeremias Knoblauch, Theodoros Damoulas, Francois-Xavier Briol

Restless multi-arm bandits (RMABs) are receiving renewed attention for their potential to model real-world planning problems under resource constraints. However, few RMAB models have surpassed theoretical interest, since they make the limiting assumption that model parameters are perfectly known. In the real world, model parameters often must be estimated via historical data or expert input, introducing uncertainty. In this light, we introduce a new paradigm, \emph{Robust RMABs}, a challenging generalization of RMABs that incorporates interval uncertainty over parameters of the dynamic model of each arm. This uncovers several new challenges for RMABs and inspires new algorithmic techniques of general interest. Our contributions are: (i)~We introduce the Robust Restless Bandit problem with interval uncertainty and solve a minimax regret objective; (ii)~We tackle the complexity of the robust objective via a double oracle (DO) approach and analyze its convergence; (iii)~To enable our DO approach, we introduce RMABPPO, a novel deep reinforcement learning (RL) algorithm for solving RMABs, of potential general interest.; (iv)~We design the first adversary algorithm for RMABs, required to implement the notoriously difficult minimax regret adversary oracle and also of general interest, by formulating it as a multi-agent RL problem and solving with a multi-agent extension of RMABPPO.

Jackson Killian, Lily Xu, Arpita Biswas, Milind Tambe

We introduce a computational efficient data-driven framework suitable for the quantification of the uncertainty in physical parameters of computer models, represented by differential equations. We construct physics-informed priors for time-dependent differential equations, which are multi-output Gaussian process (GP) priors that encode the model's structure in the covariance function. We extend this into a fully Bayesian framework which allows quantifying the uncertainty of physical parameters and model predictions. Since physical models are usually imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. To obtain the posterior distributions we use Hamiltonian Monte Carlo (HMC) sampling.

This work is primarily motivated by the need for interpretable parameters for the hemodynamics of the heart for personal treatment of hypertension. The model used is the arterial Windkessel model, which represents the hemodynamics of the heart through differential equations with physically interpretable parameters of medical interest. As most physical models, the Windkessel model is an imperfect description of the real process. To demonstrate our approach we simulate noisy data from a more complex physical model with known mathematical connections to our modeling choice. We show that without accounting for discrepancy, the posterior of the physical parameters deviates from the true value while accounting for discrepancy gives reasonable quantification of physical parameters uncertainty and reduces the uncertainty in subsequent model predictions.

Michail Spitieris

Scientists have long recognized deficiencies in their models, particularly in those that seek to describe the full distribution of a set of data. Statistics is replete with ways to address these deficiencies, including adjusting the data (e.g., removing outliers), expanding the class of models under consideration, and the use of robust methods. In this work, we pursue a different path, searching for a recognizable portion of a model that is approximately correct and which aligns with the goal of inference. Once such a model portion has been found, traditional statistical theory applies and suggests effective methods. We illustrate this approach with linear discriminant analysis and show much better performance than one gets by ignoring the deficiency in the model or by working in a large enough space to capture the main deficiency in the model.

Jiae Kim, Steve MacEachern

There are two orthogonal paradigms for hyperparameter inference: either to make a joint estimation in a larger hierarchical Bayesian model or to optimize the tuning parameter with respect to cross-validation metrics. Both are limited: the “full Bayes” strategy is conceptually unjustified in misspecified models, and may severely under- or over-fit observations; The cross-validation strategy, besides its computation cost, typically results in a point estimate, ignoring the uncertainty in hyperparameters. To bridge the two extremes, we present a general paradigm: a full-Bayes model on top of the cross-validated log likelihood. This prediction-aware approach incorporates additional regularization during hyperparameter tuning, and facilities Bayesian workflow in many otherwise black-box learning algorithms. We develop theory justification and discuss its application in a model averaging example.

Yuling Yao, Aki Vehtari

A Bayesian hierarchical model (BHM) is typically formulated specifying the data model, the parameters model and the prior distributions. The posterior inference of a BHM depends both on the model specification and on the computation algorithm used. The most straightforward way to test the reliability of a BHM inference is to compare the posterior distributions with the ground truth value of the model parameters, when available. However, when dealing with experimental data, the true value of the underlying parameters is typically unknown. In these situations, numerical experiments based on synthetic datasets generated from the model itself offer a natural approach to check model performance and posterior estimates. Surprisingly, validation of BHMswith high-dimensional parameter spaces and non-Gaussian distributions is unexplored. In this paper, we show how to test the reliability of a BHM. We introduce a change in the model assumptions to allow for prior contamination and develop a simulation-based evaluation framework to assess the reliability of the inference of a given BHM. We illustrate our approach on a specific BHM used for the analysis of Single-cell Sequencing Data (BASiCS).

Sijia Li

Structural econometric models are used to combine economic theory and data to estimate parameters and counterfactuals, e.g. the effect of a policy change. These models typically make functional form assumptions, e.g. the distribution of latent variables. I propose a framework to characterize the sensitivity of structural estimands with respect to misspecification of distributional assumptions of the model. Specifically, I characterize the lower and upper bounds on the estimand as the assumption is perturbed infinitesimally on the tangent space and locally in a neighborhood of the model's assumption. I compute bounds by finding the gradient of the estimand, and integrate these iteratively to construct the gradient flow curve through neighborhoods of the model's assumption. My framework covers models with general smooth dependence on the distributional assumption, allows sensitivity perturbations over neighborhoods described by a general metric, and is computationally tractable, in particular, it is not required to resolve the model under any alternative distributional assumption. I illustrate the framework with an application to the Rust (1987) model of optimal replacement of bus engines.

Yaroslav Mukhin

Generalised Bayesian inference updates prior beliefs using a loss function, rather than a likelihood, and can therefore be used to confer robustness against possible misspecification of the likelihood. Here we consider generalised Bayesian inference with a Stein discrepancy as a loss function, motivated by applications in which the likelihood contains an intractable normalisation constant. In this context, the Stein discrepancy circumvents evaluation of the normalisation constant and produces generalised posteriors that are either closed form or accessible using standard Markov chain Monte Carlo.

Takuo Matsubara, Jeremias Knoblauch, Francois-Xavier Briol, Chris Oates

Variational autoencoders (VAEs) have been successfully applied to complex input data such as images and videos. Counterintuitively, their application to simpler, heterogeneous data—where features are of different types, often leads to underwhelming results. While the goal in the heterogeneous case is to accurately approximate all observed features, VAEs often perform poorly in a subset of them. In this work, we study this feature overlooking problem through the lens of multitask learning (MTL), relating it to the problem of negative transfer and the interaction between gradients from different features. With these new insights, we propose to train VAEs by leveraging off-the-shelf solutions from the MTL literature based on multi-objective optimization. Furthermore, we empirically demonstrate how these solutions significantly boost the performance of different VAE models and training objectives on a large variety of heterogeneous datasets.

Adrián Javaloy, Maryam Meghdadi, Isabel Valera

The Bayesian posterior minimizes the "inferential risk" which itself bounds the "predictive risk." This bound is tight when the likelihood and prior are well-specified. However since misspecification induces a gap, the Bayesian posterior predictive distribution may have poor generalization performance. This work develops a multi-sample loss (PAC^m) which can close the gap by spanning a trade-off between the two risks. The loss is computationally favorable and offers PAC generalization guarantees. Empirical study demonstrates improvement to the predictive distribution.

Joshua V Dillon, Warren R Morningstar, Alex Alemi

Author Information

Diana Cai (Princeton University)
Sameer Deshpande (Wharton Statistics)
Mike Hughes (Tufts University)
Tamara Broderick (MIT)
Trevor Campbell (UBC)
Nick Foti (Apple & University of Washington)
Barbara Engelhardt (Princeton University)
Sinead Williamson (University of Texas at Austin)

More from the Same Authors