Workshop
Your Model is Wrong: Robustness and misspecification in probabilistic modeling
Diana Cai · Sameer Deshpande · Michael Hughes · Tamara Broderick · Trevor Campbell · Nick Foti · Barbara Engelhardt · Sinead Williamson
Probabilistic modeling is a foundation of modern data analysis  due in part to the flexibility and interpretability of these methods  and has been applied to numerous application domains, such as the biological sciences, social and political sciences, engineering, and health care. However, any probabilistic model relies on assumptions that are necessarily a simplification of complex reallife processes; thus, any such model is inevitably misspecified in practice. In addition, as data set sizes grow and probabilistic models become more complex, applying a probabilistic modeling analysis often relies on algorithmic approximations, such as approximate Bayesian inference, numerical approximations, or data summarization methods. Thus in many cases, approximations used for efficient computation lead to fitting a misspecified model by design (e.g., variational inference). Importantly, in some cases, this misspecification leads to useful model inferences, but in others it may lead to misleading and potentially harmful inferences that may then be used for important downstream tasks for, e.g., making scientific inferences or policy decisions.
The goal of the workshop is to bring together researchers focused on methods, applications, and theory to outline some of the core problems in specifying and applying probabilistic models in modern data contexts along with current stateoftheart solutions. Participants will leave the workshop with (i) exposure to recent advances in the field, (ii) an idea of the current major challenges in the field, and (iii) an introduction to methods meeting these challenges. These goals will be accomplished through a series of invited and contributed talks, poster spotlights, poster sessions, as well as ample time for discussion and live Q&A.
Schedule
Tue 4:55 a.m.  5:00 a.m.

Welcome remarks
(
Talk
)
SlidesLive Video 
Diana Cai 🔗 
Tue 5:00 a.m.  5:30 a.m.

How to train your model when it's wrong: Bayesian nonparametric learning in Mopen
(
Invited Talk
)
SlidesLive Video 
Chris C Holmes 🔗 
Tue 5:30 a.m.  5:35 a.m.

Invite Talk 1 Q&A ( Q&A ) link  Chris C Holmes 🔗 
Tue 5:35 a.m.  6:05 a.m.

BayesCG: A probabilistic numeric linear solver
(
Invited Talk
)
SlidesLive Video We present the probabilistic numeric solver BayesCG, for solving linear systems with real symmetric positive definite coefficient matrices. BayesCG is an uncertainty aware extension of the conjugate gradient (CG) method that performs solutionbased inference with Gaussian distributions to capture the uncertainty in the solution due to early termination. Under a structure exploiting `Krylov' prior, BayesCG produces the same iterates as CG. The Krylov posterior covariances have low rank, and are maintained in factored form to preserve symmetry and positive semidefiniteness. This allows efficient generation of accurate samples to probe uncertainty in subsequent computation. Speaker bio: Ilse C.F. Ipsen received a BS from the University of Kaiserslautern in Germany and a Ph.D. from Penn State, both in Computer Science. She is a Professor of Mathematics at NCState, with affiliate appointments in Statistics and the Institute for Advanced Analytics. Her research interests include numerical linear algebra, randomized algorithms, and probabilistic numerics. She is a Fellow of the AAAS and SIAM. 
Ilse Ipsen 🔗 
Tue 6:05 a.m.  6:10 a.m.

Invited Talk 2 Q&A
(
Q&A
)

Ilse Ipsen 🔗 
Tue 6:10 a.m.  6:45 a.m.

Individual discussions in Gathertown ( Gathertown discussion ) link  🔗 
Tue 6:45 a.m.  7:00 a.m.

Bayesian Calibration of imperfect computer models using Physicsinformed priors
(
Contributed Talk
)
SlidesLive Video We introduce a computational efficient datadriven framework suitable for the quantification of the uncertainty in physical parameters of computer models, represented by differential equations. We construct physicsinformed priors for timedependent differential equations, which are multioutput Gaussian process (GP) priors that encode the model's structure in the covariance function. We extend this into a fully Bayesian framework which allows quantifying the uncertainty of physical parameters and model predictions. Since physical models are usually imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. To obtain the posterior distributions we use Hamiltonian Monte Carlo (HMC) sampling. This work is primarily motivated by the need for interpretable parameters for the hemodynamics of the heart for personal treatment of hypertension. The model used is the arterial Windkessel model, which represents the hemodynamics of the heart through differential equations with physically interpretable parameters of medical interest. As most physical models, the Windkessel model is an imperfect description of the real process. To demonstrate our approach we simulate noisy data from a more complex physical model with known mathematical connections to our modeling choice. We show that without accounting for discrepancy, the posterior of the physical parameters deviates from the true value while accounting for discrepancy gives reasonable quantification of physical parameters uncertainty and reduces the uncertainty in subsequent model predictions. 
Michail Spitieris 🔗 
Tue 7:00 a.m.  7:15 a.m.

Invariant Priors for Bayesian Quadrature
(
Contributed Talk
)
SlidesLive Video Bayesian quadrature (BQ) is a modelbased numerical integration method that is able to increase sample efficiency by encoding and leveraging known structure of the integration task at hand. In this paper, we explore priors that encode invariance of the integrand under a set of bijective transformations in the input domain, in particular some unitary transformations, such as rotations, axisflips, or point symmetries. We show initial results on superior performance in comparison to standard Bayesian quadrature on several synthetic and one real world application. 
Masha Naslidnyk 🔗 
Tue 7:15 a.m.  8:30 a.m.

Poster Session I in Gathertown ( Poster session ) link  🔗 
Tue 8:30 a.m.  9:35 a.m.

Research panel
(
Discussion panel
)
SlidesLive Video Join us for a discussion with David Dunson, Maria Kwiatkowska, Steve MacEachern, Jeffrey Miller, and Briana Stephenson. Moderated by: Anirban Bhattacharya. 
David Dunson · Marta Kwiatkowska · Steven MacEachern · Jeffrey Miller · Briana Joy Stephenson · Anirban Bhattacharya 🔗 
Tue 9:35 a.m.  10:30 a.m.

Individual discussions in Gathertown ( Gathertown discussion ) link  🔗 
Tue 10:30 a.m.  10:45 a.m.

Uncertainty estimation under model misspecification in neural network regression
(
Contributed Talk
)
SlidesLive Video Although neural networks are powerful function approximators, the underlying modelling assumptions ultimately define the likelihood and thus the model class they are parameterizing. In classification, these assumptions are minimal as the commonly employed softmax is capable of representing any discrete distribution over a finite set of outcomes. In regression, however, restrictive assumptions on the type of continuous distribution to be realized are typically placed, like the dominant choice of training via meansquared error and its underlying Gaussianity assumption. Recently, modelling advances allow to be agnostic to the type of continuous distribution to be modelled, granting regression the flexibility of classification models. While past studies stress the benefit of such flexible regression models in terms of performance, here we study the effect of the model choice on uncertainty estimation. We highlight that under model misspecification, aleatoric uncertainty is not properly captured, and that a Bayesian treatment of a misspecified model leads to unreliable epistemic uncertainty estimates. Overall, our study provides an overview on how modelling choices in regression may influence uncertainty estimation and thus any downstream decision making process. 
Maria Cervera 🔗 
Tue 10:45 a.m.  11:00 a.m.

Your Bandit Model is Not Perfect: Introducing Robustness to Restless Bandits Enabled by Deep Reinforcement Learning
(
Contributed Talk
)
SlidesLive Video Restless multiarm bandits (RMABs) are receiving renewed attention for their potential to model realworld planning problems under resource constraints. However, few RMAB models have surpassed theoretical interest, since they make the limiting assumption that model parameters are perfectly known. In the real world, model parameters often must be estimated via historical data or expert input, introducing uncertainty. In this light, we introduce a new paradigm, \emph{Robust RMABs}, a challenging generalization of RMABs that incorporates interval uncertainty over parameters of the dynamic model of each arm. This uncovers several new challenges for RMABs and inspires new algorithmic techniques of general interest. Our contributions are: (i)~We introduce the Robust Restless Bandit problem with interval uncertainty and solve a minimax regret objective; (ii)~We tackle the complexity of the robust objective via a double oracle (DO) approach and analyze its convergence; (iii)~To enable our DO approach, we introduce RMABPPO, a novel deep reinforcement learning (RL) algorithm for solving RMABs, of potential general interest.; (iv)~We design the first adversary algorithm for RMABs, required to implement the notoriously difficult minimax regret adversary oracle and also of general interest, by formulating it as a multiagent RL problem and solving with a multiagent extension of RMABPPO. 
Jackson Killian 🔗 
Tue 11:00 a.m.  11:30 a.m.

Bayesian Model Averaging is not Model Combination: A PACBayesian Analysis of Deep Ensembles
(
Invited Talk
)
SlidesLive Video Almost twenty years ago, Thomas Minka nicely illustrated that Bayesian model averaging (BMA) is different from model combination. Model combination works by enriching the model space, because it considers all possible linear combinations of all the models in the model class, while BMA represents the inability for knowing which is the best single model when using a limited amount data. However, twenty years later, this distinction becomes not so clear in the context of ensembles of deep neural networks: are deep ensembles performing a crude approximation of a highly multimodal Bayesian posterior? Or, are they exploiting an enriched model space and, in consequence, they should be interpreted in terms of model combination? In this talk, we will introduce recently published theoretical analyses that will shed some light on these questions. As you will see in this talk, whether your model is wrong or not plays a crucial role in the answers to these questions. Speaker bio: Andres R. Masegosa is an associate professor at the Department of Computer Science at Aalborg University (Copenhagen CampusDenmark). Previously, he was an assistant professor at the University of Almería (Spain). He got his PhD in Computer Science at the University of Granada in 2009. He is broadly interested in modelling intelligent agents that learn from experience using a probabilistic approach. He has published more than sixty papers in international journals and conferences in the field of machine learning. 
Andres Masegosa 🔗 
Tue 11:00 a.m.  11:05 a.m.

Invited Talk 3 Q&A
(
Q&A
)

Andres Masegosa 🔗 
Tue 11:35 a.m.  12:00 p.m.

Individual discussions in Gathertown ( Gathertown discussion ) link  🔗 
Tue 12:00 p.m.  12:15 p.m.

PAC^mBayes: Narrowing the Empirical Risk Gap in the Misspecified Bayesian Regime
(
Contributed Talk
)
SlidesLive Video The Bayesian posterior minimizes the "inferential risk" which itself bounds the "predictive risk." This bound is tight when the likelihood and prior are wellspecified. However since misspecification induces a gap, the Bayesian posterior predictive distribution may have poor generalization performance. This work develops a multisample loss (PAC^m) which can close the gap by spanning a tradeoff between the two risks. The loss is computationally favorable and offers PAC generalization guarantees. Empirical study demonstrates improvement to the predictive distribution. 
Alexander Alemi 🔗 
Tue 12:15 p.m.  12:30 p.m.

Bayesian Data Selection
(
Contributed Talk
)
SlidesLive Video Insights into complex, highdimensional data can be obtained by discovering features of the data that match or do not match a model of interest. To formalize this task, we introduce the "data selection" problem: finding a lowerdimensional statistic  such as a subset of variables  that is well fit by a given parametric model of interest. A fully Bayesian approach to data selection would be to parametrically model the value of the statistic, nonparametrically model the remaining "background" components of the data, and perform standard Bayesian model selection for the choice of statistic. However, fitting a nonparametric model to highdimensional data tends to be highly inefficient, statistically and computationally. We propose a novel score for performing data selection, the "Stein volume criterion (SVC)", that does not require fitting a nonparametric model. The SVC takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in place of the KullbackLeibler divergence. We prove that the SVC is consistent for data selection. We apply the SVC to the analysis of singlecell RNA sequencing datasets using a spin glass model of gene regulation. 
Eli N Weinstein 🔗 
Tue 12:30 p.m.  1:00 p.m.

Statistically Robust Inference with Stochastic Gradient Algorithms
(
Invited Talk
)
SlidesLive Video Stochastic gradient algorithms are widely used for largescale learning and inference problems. However, their use in practice is typically guided by heuristics and trialanderror rather than rigorous, generalizable theory. We take a step toward better understanding the effect of the tuning parameters of these algorithms by characterizing the largesample behavior of iterates of a very general class of preconditioned stochastic gradient algorithms with fixed step size, including stochastic gradient descent with and without additional Gaussian noise, momentum, and/or acceleration. We show that near a local optimum, the iterates converge weakly to paths of an Ornstein–Uhlenbeck process, and provide sufficient conditions for the stationary distributions of the finitesample processes to converge weakly to that of the limiting process. In particular, with appropriate choices of tuning parameters, the limiting stationary covariance can match either the Bernstein–von Miseslimit of the posterior, adjustments to the posterior for model misspecification, or the asymptotic distribution of the maximum likelihood estimate – and that with a naive tuning, the limit corresponds to none of these. Moreover, we argue that, in the largesample regime, an essentially independent sample from the stationary distribution can be obtained after a fixed number of passes over the dataset. Our results show that properly tuned stochastic gradient algorithms offer a practical approach to obtaining inferences that are computationally efficient and statistically robust. Speaker Bio: Jonathan Huggins is an Assistant Professor in the Department of Mathematics & Statistics, a Data Science Faculty Fellow, and a Founding Member of the Faculty of Computing & Data Sciences at Boston University. Prior to joining BU, he was a Postdoctoral Research Fellow in the Department of Biostatistics at Harvard. He completed his Ph.D. in Computer Science at the Massachusetts Institute of Technology in 2018. Previously, he received a B.A. in Mathematics from Columbia University and an S.M. in Computer Science from the Massachusetts Institute of Technology. His research centers on the development of fast, trustworthy machine learning and AI methods that balance the need for computational efficiency and the desire for statistical optimality with the inherent imperfections that come from realworld problems, large datasets, and complex models. His current applied work is focused on methods to enable more effective scientific discovery from highthroughput and multimodal genomic data. 
Jonathan Huggins 🔗 
Tue 1:00 p.m.  1:05 p.m.

Invited Talk 4 Q&A
(
Q&A
)

Jonathan Huggins 🔗 
Tue 1:05 p.m.  1:30 p.m.

Individual discussions in Gathertown ( Gathertown discussion ) link  🔗 
Tue 1:30 p.m.  2:00 p.m.

Your Model is Wrong (but Might Still Be Useful)
(
Invited Talk
)
SlidesLive Video To improve the efficiency of Monte Carlo estimation, practitioners are turning to biased Markov chain Monte Carlo procedures that trade off asymptotic exactness for computational speed. The reasoning is sound: a reduction in variance due to more rapid sampling can outweigh the bias introduced. However, the inexactness creates new challenges for sampler and parameter selection, since standard measures of sample quality like effective sample size do not account for asymptotic bias. To address these challenges, I'll describe how Stein's method  a tool developed to prove central limit theorems  can be adapted to assess and improve the quality of practical inference procedures. Along the way, I’ll highlight applications to Markov chain Monte Carlo sampler selection, goodnessoffit testing, and blackbox importance sampling. Speaker Bio: Lester Mackey is a Principal Researcher at Microsoft Research, where he develops machine learning methods, models, and theory for largescale learning tasks driven by applications from climate forecasting, healthcare, and the social good. Lester moved to Microsoft from Stanford University, where he was an assistant professor of Statistics and (by courtesy) of Computer Science. He earned his PhD in Computer Science and MA in Statistics from UC Berkeley and his BSE in Computer Science from Princeton University. He coorganized the second place team in the Netflix Prize competition for collaborative filtering, won the Prize4Life ALS disease progression prediction challenge, won prizes for temperature and precipitation forecasting in the yearlong realtime Subseasonal Climate Forecast Rodeo, and received best paper and best student paper awards from the ACM Conference on Programming Language Design and Implementation and the International Conference on Machine Learning. 
Lester Mackey 🔗 
Tue 2:00 p.m.  2:05 p.m.

Invited Talk 5 Q&A
(
Q&A
)

Lester Mackey 🔗 
Tue 2:05 p.m.  2:35 p.m.

Statistical and Computational Tradeoffs in Variational Bayes
(
Invited Talk
)
SlidesLive Video Variational inference has recently emerged as a popular alternative to Markov chain Monte Carlo (MCMC) in largescale Bayesian inference. A core idea of variational inference is to trade statistical accuracy for computational efficiency. It aims to approximate the posterior, as opposed to targeting the exact posterior as in MCMC. Approximating the exact posterior by a restricted inferential model (a.k.a. variational approximating family) reduces computation costs but sacrifices its statistical accuracy. In this work, we develop a theoretical characterization of this statisticalcomputational tradeoff in variational inference. We focus on a case study of Bayesian linear regression using inferential models (a.k.a. variational approximating families) with different degrees of flexibility. From a computational perspective, we find that less flexible variational families speed up computation. They reduce the variance in stochastic optimization and in turn, accelerate convergence. From a statistical perspective, however, we find that less flexible families suffer in approximation quality, but provide better statistical generalization. This is joint work with Kush Bhatia, Nikki Kuang, and Yian Ma. Speaker Bio: Yixin Wang is an LSA Collegiate Fellow in Statistics at the University of Michigan. She works in the fields of Bayesian statistics, machine learning, and causal inference. Previously, she was a postdoctoral researcher with Professor Michael Jordan at the University of California, Berkeley. She completed her PhD in statistics at Columbia, advised by Professor David Blei, and her undergraduate studies in mathematics and computer science at the Hong Kong University of Science and Technology. Her research has received several awards, including the INFORMS data mining best paper award, BlackwellRosenbluth Award from the junior section of ISBA, student paper awards from ASA Biometrics Section and Bayesian Statistics Section, and the ICSA conference young researcher award. 
Yixin Wang 🔗 
Tue 2:35 p.m.  2:40 p.m.

Invited Talk 6 Q&A
(
Q&A
)

Yixin Wang 🔗 
Tue 3:15 p.m.  4:30 p.m.

Poster session II in Gathertown + End ( Poster session ) link  🔗 


Bayesian Data Selection
(
Poster
)
Insights into complex, highdimensional data can be obtained by discovering features of the data that match or do not match a model of interest. To formalize this task, we introduce the "data selection" problem: finding a lowerdimensional statistic  such as a subset of variables  that is well fit by a given parametric model of interest. A fully Bayesian approach to data selection would be to parametrically model the value of the statistic, nonparametrically model the remaining "background" components of the data, and perform standard Bayesian model selection for the choice of statistic. However, fitting a nonparametric model to highdimensional data tends to be highly inefficient, statistically and computationally. We propose a novel score for performing data selection, the "Stein volume criterion (SVC)", that does not require fitting a nonparametric model. The SVC takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in place of the KullbackLeibler divergence. We prove that the SVC is consistent for data selection. We apply the SVC to the analysis of singlecell RNA sequencing datasets using a spin glass model of gene regulation. 
Eli N Weinstein · Jeffrey Miller 🔗 


Uncertainty estimation under model misspecification in neural network regression
(
Poster
)
Although neural networks are powerful function approximators, the underlying modelling assumptions ultimately define the likelihood and thus the model class they are parameterizing. In classification, these assumptions are minimal as the commonly employed softmax is capable of representing any discrete distribution over a finite set of outcomes. In regression, however, restrictive assumptions on the type of continuous distribution to be realized are typically placed, like the dominant choice of training via meansquared error and its underlying Gaussianity assumption. Recently, modelling advances allow to be agnostic to the type of continuous distribution to be modelled, granting regression the flexibility of classification models. While past studies stress the benefit of such flexible regression models in terms of performance, here we study the effect of the model choice on uncertainty estimation. We highlight that under model misspecification, aleatoric uncertainty is not properly captured, and that a Bayesian treatment of a misspecified model leads to unreliable epistemic uncertainty estimates. Overall, our study provides an overview on how modelling choices in regression may influence uncertainty estimation and thus any downstream decision making process. 
Maria Cervera · Rafael Dätwyler · Francesco D'Angelo · Hamza Keurti · Benjamin F. Grewe · Christian Henning 🔗 


Fast approximate BayesBag model selection via Taylor expansions
(
Poster
)
BayesBag has been established as a useful tool for robust Bayesian model selection. However, computing BayesBag can be prohibitively expensive for large datasets. Here, we propose a fast approximation of BayesBag model selection. This approximationbased on Taylor approximations of the log marginal likelihoodcan achieve results comparable to BayesBag in a fraction of the time. 
Neil Spencer · Jeffrey Miller 🔗 


Diversity and Generalization in Neural Network Ensembles
(
Poster
)
Ensembles are widely used in machine learning and, usually, provide stateoftheart performance in many prediction tasks. From the very beginning, diversity of ensemble members has been identified as a key factor for the superior performance of an ensemble. But the exact role that diversity plays in an ensemble model is not fully understood and is still an open question. In this work, we employ a second order PACBayesian analysis to shed light on this problem in the context of neural network ensembles. More precisely, we provide sound theoretical answers to the following questions: how to measure diversity, how diversity relates to the generalization error and how diversity can be promoted by ensemble learning algorithms. This analysis covers three widely used loss functions, namely, the squared loss, the crossentropy loss, and the 01 loss; and two widely used model combination strategies, namely, model averaging and weighted majority vote. We empirically validate this theoretical analysis on ensembles of neural networks. 
Luis Antonio Ortega Andrés · Andres Masegosa · Rafael Cabañas 🔗 


A shared parameter model accounting for dropout not at random in a predictive model for systolic bloodpressure using the HUNT study
(
Poster
)
This work proposes and evaluates a shared parameter model (SPM) to account for data being missing not at random (MNAR) for a predictive model based on a longitudinal population study. The aim is to model systolic blood pressure ten years ahead based on current observations and is inspired by and evaluated for data from the NordTrøndelag Health Study (HUNT). The proposed SPM consists of a linear model for the systolic blood pressure and a logistic model for the dropout process connected through a shared random effect. To evaluate the SPM we compare the parameter estimates and predictions of the SPM with a naive linear Bayesian model using the same explanatory variables while ignoring the dropout process. This corresponds to assuming data to be missing at random (MAR). In addition, a simulation study is performed in which the naive model and the SPM are tested on data with known parameters when missingness is assumed to be MNAR. The SPM indicates that participants with higher systolic blood pressure than expected from the explanatory variables at the time of the followup study have a higher probability of dropping out, suggesting that the data are MNAR. Further, the SPM and the naive model result in different parameter estimates for the explanatory variables. The simulation study validates that the SPM is identifiable for the estimates obtained by the predictive model based on the HUNT study 
Aurora Christine Hofman 🔗 


Influential Observations in Bayesian Regression Tree Models
(
Poster
)
BCART (Bayesian Classification and Regression Trees) and BART (Bayesian Additive Regression Trees) are popular modern regression models. Their popularity is intimately tied to the ability to flexibly model complex responses depending on highdimensional inputs while simultaneously being able to quantify uncertainties. However, surprisingly little work has been done to evaluate the sensitivity of these modern regression models to violations of modeling assumptions. In particular, we consider influential observations and propose methods for detecting influentials and adjusting predictions to not be unduly affected by such problematic data. We consider two detection diagnostics for Bayesian tree models, one an analogue of Cook's distance and the other taking the form of a divergence measure, and then propose an importance sampling algorithm to reweight previously sampled posterior draws so as to remove the effects of influential data. Finally, our methods are demonstrated on realworld data where blind application of models can lead to poor predictions. 
Matthew Pratola 🔗 


Invariant Priors for Bayesian Quadrature
(
Poster
)
Bayesian quadrature (BQ) is a modelbased numerical integration method that is able to increase sample efficiency by encoding and leveraging known structure of the integration task at hand. In this paper, we explore priors that encode invariance of the integrand under a set of bijective transformations in the input domain, in particular some unitary transformations, such as rotations, axisflips, or point symmetries. We show initial results on superior performance in comparison to standard Bayesian quadrature on several synthetic and one real world application. 
Masha Naslidnyk · Javier González · Maren Mahsereci 🔗 


Composite Goodnessoffit Tests with Kernels
(
Poster
)
Model misspecification can create significant challenges for the implementation of probabilistic models, and this has led to development of a range of inference methods which directly account for this issue. However, whether these more involved methods are required will depend on whether the model is really misspecified, and there is a lack of generally applicable methods to answer this question. One set of tools which can help are goodnessoffit tests, where we test whether a dataset could have been generated by a fixed distribution. Kernelbased tests have been developed to for this problem, and these are popular due to their flexibility, strong theoretical guarantees and ease of implementation in a wide range of scenarios. In this paper, we extend this line of work to the more challenging composite goodnessoffit problem, where we are instead interested in whether the data comes from any distribution in some parametric family. This is equivalent to testing whether a parametric model is wellspecified for the data. 
Oscar Key · Tamara Fernandez · Arthur Gretton · FrancoisXavier Briol 🔗 


Inferior Clusterings in Misspecified Gaussian Mixture Models
(
Poster
)
Gaussian Mixture Model (GMM) is a widely used probabilistic model for clustering. In many practical settings, the true data distribution, which is unknown, may be nonGaussian and may be contaminated by noise or outliers. In such cases, clustering may still be done with a misspecified GMM. However, this may lead to incorrect classification of the underlying subpopulations. In this work, we examine the performance of both Expectation Maximization (EM) and Gradient Descent (GD) on unconstrained Gaussian Mixture Models when there is misspecification. Our simulation study reveals a previously unreported class of \textit{inferior} clustering solutions, different from spurious solutions, that occurs due to asymmetry in the fitted component variances. 
Siva Rajesh Kasa · Vaibhav Rajan 🔗 


Blindness of scorebased methods to isolated components and mixing proportions
(
Poster
)
Abstract Statistical tasks such as density estimation and approximate Bayesian inference often involve densities with unknown normalising constants. Scorebased methods, including score matching, are popular techniques as they are free of normalising constants. Although these methods enjoy theoretical guarantees, a littleknown fact is that they suffer from practical failure modes when the unnormalised distribution of interest has isolated components  they cannot discover isolated components or identify the correct mixing proportions between components. We demonstrate these findings using simple distributions and present heuristic attempts to address these issues. We hope to bring the attention of theoreticians and practitioners to these issues when developing new algorithms and applications. 
Li Kevin Wenliang · Heishiro Kanagawa 🔗 


Bounding Wasserstein distance with couplings
(
Poster
)
Markov chain Monte Carlo (MCMC) methods are a powerful tool in Bayesian computation. They provide asymptotically consistent estimates as the number of iterations tends to infinity. However, in large data applications, MCMC can be computationally expensive per iteration. This has catalyzed interest in sampling methods such as approximate MCMC, which trade off asymptotic consistency for improved computational speed. In this article, we propose estimators based on couplings of Markov chains to assess the quality of such asymptotically biased sampling methods. The estimators give empirical upper bounds of the Wassertein distance between the limiting distribution of the asymptotically biased sampling method and the original target distribution of interest. We establish theoretical guarantees for our upper bounds and show that our estimators can remain effective in high dimensions. We apply our sample quality measures to stochastic gradient MCMC, variational Bayes, and Laplace approximations for tall data and to approximate MCMC for highdimensional linear regression and highdimensional logistic regression. 
Niloy Biswas · Lester Mackey 🔗 


Relaxing the I.I.D. Assumption: Adaptively Minimax Optimal Regret via RootEntropic Regularization
(
Poster
)
We introduce the semiadversarial framework for sequential prediction with expert advice, where data are generated from distributions varying arbitrarily within an unknown constraint set. We quantify relaxations of the classical i.i.d. assumption along a spectrum induced by this framework, with i.i.d. sequences at one extreme and adversarial mechanisms at the other. The Hedge algorithm, which corresponds to using an expertvalued Bayesian power posterior to make decisions, was recently shown to be simultaneously optimal for both i.i.d. and adversarial data. We demonstrate that Hedge is suboptimal at all points of the spectrum in between these endpoints. Further, we introduce a novel algorithm and prove that it achieves the minimax optimal rate of regret at all points along the semiadversarial spectrumwithout advance knowledge of the constraint set. This algorithm corresponds to followtheregularizedleader, constructed by replacing the Shannon entropy regularizer of Hedge with the squareroot of the Shannon entropy. 
Blair Bilodeau · Jeffrey Negrea · Dan Roy 🔗 


Measuring the sensitivity of Gaussian processes to kernel choice
(
Poster
)
Gaussian processes (GPs) are used to make medical and scientific decisions, including in cardiac care and monitoring of carbon dioxide emissions. But the choice of GP kernel is often somewhat arbitrary. In particular, uncountably many kernels typically align with qualitative prior knowledge (e.g. function smoothness or stationarity). But in practice, data analysts choose among a handful of convenient standard kernels (e.g. squared exponential). In the present work, we ask: Would decisions made with a GP differ under other, qualitatively interchangeable kernels? We show how to formulate this sensitivity analysis as a constrained optimization problem over a finitedimensional space. We can then use standard optimizers to identify substantive changes in relevant decisions made with a GP. We demonstrate in both synthetic and realworld examples that decisions made with a GP can exhibit substantial sensitivity to kernel choice, even when prior draws are qualitatively interchangeable to a user. 
Will Stephenson · Soumya Ghosh · Tin Nguyen · Mikhail Yurochkin · Sameer Deshpande · Tamara Broderick 🔗 


Robust Bayesian Inference for Simulatorbased Models via the MMD Posterior Bootstrap
(
Poster
)
Simulatorbased models are models for which the likelihood is intractable but simulation of synthetic data is possible. Such models are often used to describe complex realworld phenomena, and as such can often be misspecified in practice. Unfortunately, existing Bayesian approaches for simulators are known to perform poorly in misspecified settings. In this paper, we propose a novel approach based on the posterior bootstrap which gives a highlyparallelisable Bayesian inference algorithm for simulatorbased models. Our approach is based on maximum mean discrepancy estimators, which also allows us to inherit their robustness properties. 
Harita Dellaporta · Jeremias Knoblauch · Theodoros Damoulas · FrancoisXavier Briol 🔗 


Your Bandit Model is Not Perfect: Introducing Robustness to Restless Bandits Enabled by Deep Reinforcement Learning
(
Poster
)
Restless multiarm bandits (RMABs) are receiving renewed attention for their potential to model realworld planning problems under resource constraints. However, few RMAB models have surpassed theoretical interest, since they make the limiting assumption that model parameters are perfectly known. In the real world, model parameters often must be estimated via historical data or expert input, introducing uncertainty. In this light, we introduce a new paradigm, \emph{Robust RMABs}, a challenging generalization of RMABs that incorporates interval uncertainty over parameters of the dynamic model of each arm. This uncovers several new challenges for RMABs and inspires new algorithmic techniques of general interest. Our contributions are: (i)~We introduce the Robust Restless Bandit problem with interval uncertainty and solve a minimax regret objective; (ii)~We tackle the complexity of the robust objective via a double oracle (DO) approach and analyze its convergence; (iii)~To enable our DO approach, we introduce RMABPPO, a novel deep reinforcement learning (RL) algorithm for solving RMABs, of potential general interest.; (iv)~We design the first adversary algorithm for RMABs, required to implement the notoriously difficult minimax regret adversary oracle and also of general interest, by formulating it as a multiagent RL problem and solving with a multiagent extension of RMABPPO. 
Jackson Killian · Lily Xu · Arpita Biswas · Milind Tambe 🔗 


Bayesian Calibration of imperfect computer models using Physicsinformed priors
(
Poster
)
We introduce a computational efficient datadriven framework suitable for the quantification of the uncertainty in physical parameters of computer models, represented by differential equations. We construct physicsinformed priors for timedependent differential equations, which are multioutput Gaussian process (GP) priors that encode the model's structure in the covariance function. We extend this into a fully Bayesian framework which allows quantifying the uncertainty of physical parameters and model predictions. Since physical models are usually imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. To obtain the posterior distributions we use Hamiltonian Monte Carlo (HMC) sampling. This work is primarily motivated by the need for interpretable parameters for the hemodynamics of the heart for personal treatment of hypertension. The model used is the arterial Windkessel model, which represents the hemodynamics of the heart through differential equations with physically interpretable parameters of medical interest. As most physical models, the Windkessel model is an imperfect description of the real process. To demonstrate our approach we simulate noisy data from a more complex physical model with known mathematical connections to our modeling choice. We show that without accounting for discrepancy, the posterior of the physical parameters deviates from the true value while accounting for discrepancy gives reasonable quantification of physical parameters uncertainty and reduces the uncertainty in subsequent model predictions. 
Michail Spitieris 🔗 


Forcing a model to be correct for classification
(
Poster
)
Scientists have long recognized deficiencies in their models, particularly in those that seek to describe the full distribution of a set of data. Statistics is replete with ways to address these deficiencies, including adjusting the data (e.g., removing outliers), expanding the class of models under consideration, and the use of robust methods. In this work, we pursue a different path, searching for a recognizable portion of a model that is approximately correct and which aligns with the goal of inference. Once such a model portion has been found, traditional statistical theory applies and suggests effective methods. We illustrate this approach with linear discriminant analysis and show much better performance than one gets by ignoring the deficiency in the model or by working in a large enough space to capture the main deficiency in the model. 
Jiae Kim · Steven MacEachern 🔗 


Make crossvalidation Bayes again
(
Poster
)
There are two orthogonal paradigms for hyperparameter inference: either to make a joint estimation in a larger hierarchical Bayesian model or to optimize the tuning parameter with respect to crossvalidation metrics. Both are limited: the “full Bayes” strategy is conceptually unjustified in misspecified models, and may severely under or overfit observations; The crossvalidation strategy, besides its computation cost, typically results in a point estimate, ignoring the uncertainty in hyperparameters. To bridge the two extremes, we present a general paradigm: a fullBayes model on top of the crossvalidated log likelihood. This predictionaware approach incorporates additional regularization during hyperparameter tuning, and facilities Bayesian workflow in many otherwise blackbox learning algorithms. We develop theory justification and discuss its application in a model averaging example. 
Yuling Yao · Aki Vehtari 🔗 


Evaluating Bayesian Hierarchical Models for scRNA seq Data
(
Poster
)
A Bayesian hierarchical model (BHM) is typically formulated specifying the data model, the parameters model and the prior distributions. The posterior inference of a BHM depends both on the model specification and on the computation algorithm used. The most straightforward way to test the reliability of a BHM inference is to compare the posterior distributions with the ground truth value of the model parameters, when available. However, when dealing with experimental data, the true value of the underlying parameters is typically unknown. In these situations, numerical experiments based on synthetic datasets generated from the model itself offer a natural approach to check model performance and posterior estimates. Surprisingly, validation of BHMswith highdimensional parameter spaces and nonGaussian distributions is unexplored. In this paper, we show how to test the reliability of a BHM. We introduce a change in the model assumptions to allow for prior contamination and develop a simulationbased evaluation framework to assess the reliability of the inference of a given BHM. We illustrate our approach on a specific BHM used for the analysis of Singlecell Sequencing Data (BASiCS). 
Sijia Li 🔗 


On Robustness of Counterfactuals in Structural Models
(
Poster
)
Structural econometric models are used to combine economic theory and data to estimate parameters and counterfactuals, e.g. the effect of a policy change. These models typically make functional form assumptions, e.g. the distribution of latent variables. I propose a framework to characterize the sensitivity of structural estimands with respect to misspecification of distributional assumptions of the model. Specifically, I characterize the lower and upper bounds on the estimand as the assumption is perturbed infinitesimally on the tangent space and locally in a neighborhood of the model's assumption. I compute bounds by finding the gradient of the estimand, and integrate these iteratively to construct the gradient flow curve through neighborhoods of the model's assumption. My framework covers models with general smooth dependence on the distributional assumption, allows sensitivity perturbations over neighborhoods described by a general metric, and is computationally tractable, in particular, it is not required to resolve the model under any alternative distributional assumption. I illustrate the framework with an application to the Rust (1987) model of optimal replacement of bus engines. 
Yaroslav Mukhin 🔗 


Robust Generalised Bayesian Inference for Intractable Likelihoods
(
Poster
)
Generalised Bayesian inference updates prior beliefs using a loss function, rather than a likelihood, and can therefore be used to confer robustness against possible misspecification of the likelihood. Here we consider generalised Bayesian inference with a Stein discrepancy as a loss function, motivated by applications in which the likelihood contains an intractable normalisation constant. In this context, the Stein discrepancy circumvents evaluation of the normalisation constant and produces generalised posteriors that are either closed form or accessible using standard Markov chain Monte Carlo. 
Takuo Matsubara · Jeremias Knoblauch · FrancoisXavier Briol · Chris Oates 🔗 


Boosting heterogeneous VAEs via multiobjective optimization
(
Poster
)
Variational autoencoders (VAEs) have been successfully applied to complex input data such as images and videos. Counterintuitively, their application to simpler, heterogeneous data—where features are of different types, often leads to underwhelming results. While the goal in the heterogeneous case is to accurately approximate all observed features, VAEs often perform poorly in a subset of them. In this work, we study this feature overlooking problem through the lens of multitask learning (MTL), relating it to the problem of negative transfer and the interaction between gradients from different features. With these new insights, we propose to train VAEs by leveraging offtheshelf solutions from the MTL literature based on multiobjective optimization. Furthermore, we empirically demonstrate how these solutions significantly boost the performance of different VAE models and training objectives on a large variety of heterogeneous datasets. 
Adrián Javaloy · Maryam Meghdadi · Isabel Valera 🔗 


PAC^mBayes: Narrowing the Empirical Risk Gap in the Misspecified Bayesian Regime
(
Poster
)
The Bayesian posterior minimizes the "inferential risk" which itself bounds the "predictive risk." This bound is tight when the likelihood and prior are wellspecified. However since misspecification induces a gap, the Bayesian posterior predictive distribution may have poor generalization performance. This work develops a multisample loss (PAC^m) which can close the gap by spanning a tradeoff between the two risks. The loss is computationally favorable and offers PAC generalization guarantees. Empirical study demonstrates improvement to the predictive distribution. 
Joshua Dillon · Warren Morningstar · Alexander Alemi 🔗 