Session
Spotlights
Stefan Schaal
Catching Up Faster in Bayesian Model Selection and Model Averaging
Tim van Erven · Peter Grünwald · Steven de Rooij
Bayesian model averaging, model selection and their approximations such as BIC are generally statistically consistent, but sometimes achieve slower rates of convergence than other methods such as AIC and leave-one-out cross-validation. On the other hand, these other methods can be inconsistent. We identify the "catch-up phenomenon" as a novel explanation for the slow convergence of Bayesian methods. Based on this analysis we define the switch-distribution, a modification of the Bayesian marginal distribution. We prove that in many situations model selection and prediction based on the switch-distribution is both consistent and achieves optimal convergence rates, thereby resolving the AIC/BIC dilemma. The method is practical; we give an efficient implementation.
A wide variety of Dirichlet-multinomial `topic' models have found interesting applications in recent years. While Gibbs sampling remains an important method of inference in such models, variational techniques have certain advantages such as easy assessment of convergence, easy optimization without the need to maintain detailed balance, a bound on the marginal likelihood, and side-stepping of issues with topic-identifiability. The most accurate variational technique thus far, namely collapsed variational LDA (CV-LDA)\cite{TehNewmanWelling06}, did not deal with model selection nor did it include inference for the hyper-parameters. We generalize their technique to the HDP to address these issues. The result is a collapsed variational inference technique that can add topics indefinitely, for instance through split and merge heuristics, while the algorithm will automatically remove clusters which are not supported by the data. Experiments show a very significant improvement in accuracy relative to CV-LDA.
Collective Inference on Markov Models for Modeling Bird Migration
Daniel Sheldon · M.A. Saleh Elmohamed · Dexter Kozen
We investigate a family of inference problems on Markov models, where many sample paths are drawn from a Markov chain and partial information is revealed to an observer who attempts to reconstruct the sample paths. We present algorithms and hardness results for several variants of this problem which arise by revealing different information to the observer and imposing different requirements for the reconstruction of sample paths. Our algorithms are analogous to the classical Viterbi algorithm for Hidden Markov Models, which finds single most probable sample path given a sequence of observations. Our work is motivated by an important application in ecology: inferring bird migration paths from a large database of observations.
Distributed Inference for Latent Dirichlet Allocation
David Newman · Arthur Asuncion · Padhraic Smyth · Max Welling
We investigate the problem of learning a widely-used latent-variable model -- the Latent Dirichlet Allocation (LDA) or topic model -- using distributed computation, where each of P processors only sees 1/P of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates---it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across P processors---it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using three real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors.
This article presents hierarchical penalization, a generic framework for incorporating prior information in the fitting of statistical models, when the explicative variables are organized in a hierarchical structure. The penalizer, derived from an adaptive penalization formulation, is a convex functional that performs soft selection at the group level, and that shrinks variables within each group, to favor solutions with few leading terms in the final combination. The framework, originally derived for taking into account prior knowledge, is shown to be useful in kernel regression, when several parameters are used to model the influence of features.
We investigate the use of message-passing algorithms for the problem of finding the max-weight independent set (MWIS) in a graph. First, we study the performance of loopy max-product belief propagation. We show that, if it converges, the quality of the estimate is closely related to the tightness of an LP relaxation of the MWIS problem. We use this relationship to obtain sufficient conditions for correctness of the estimate. We then develop a modification of max-product -- one that converges to an optimal solution of the dual of the MWIS problem. We also develop a simple iterative algorithm for estimating the max-weight independent set from this dual solution. We show that the MWIS estimate obtained using these two algorithms is conjunction correct when the graph is bipartite and the MWIS is unique. Finally, we show that any problem of MAP estimation for probability distributions over finite domains can be reduced to an MWIS problem. We believe this reduction will yield new insights and algorithms for MAP estimation.
Privacy-Preserving Belief Propagation and Sampling
Michael Kearns · Jinsong Tan · Jennifer Wortman Vaughan
We provide provably privacy-preserving versions of belief propagation, Gibbs sampling, and other local algorithms --- distributed multiparty protocols in which each party or vertex learns only its final local value, and absolutely nothing else.
We present an efficient generalization of the sparse pseudo-input Gaussian process (SPGP) model developed by Snelson and Ghahramani (2005), applying it to binary classification problems. By taking advantage of the SPGP prior covariance structure, we derive a numerically stable algorithm with O(NM^2) training complexity---asymptotically the same as related sparse methods such as the informative vector machine (Lawrence et al. (2003)), but which more faithfully represents the posterior. We present experimental results for several benchmark problems showing that in many cases this allows an exceptional degree of sparsity without compromising accuracy. Following Snelson and Ghahramani, we locate pseudo-inputs by gradient ascent on the marginal likelihood, but exhibit occasions when this is likely to fail, for which we suggest alternative solutions.