This tutorial will provide a practical overview of state-of-the-art approaches for analyzing massive data sets using Bayesian statistical methods. The first focus area will be on algorithms for very large sample size data (large n), and the second focus area will be on approaches for very high-dimensional data (large p). A particular emphasis will be on maintaining a valid characterization of uncertainty, ruling out many popular methods, such as (most) variational approximations and approaches for maximum a posteriori estimation. I will briefly review classical large sample approximations to posterior distributions (e.g., Laplace’s method, Bayesian central limit theorem), and will then transition to discussing conceptually and practical simple approaches for scaling up commonly used Markov chain Monte Carlo (MCMC) algorithms. The focus is on making posterior computation much faster to implement for huge datasets while maintaining accuracy guarantees. Some useful classes of algorithms having increasing theoretical and practical support include embarrassingly parallel (EP) MCMC, approximate MCMC, stochastic approximation, hybrid optimization and sampling, and modularization. Applications to computational advertising, genomics, neurosciences and other areas will provide a concrete motivation. Code and notes will be made available, and research problems of ongoing interest highlighted.
David Dunson (Duke University)
David Dunson is Arts & Sciences Distinguished Professor of Statistical Science and Mathematics at Duke University. He is an international authority in statistical methodology development motivated by complex and high-dimensional data, with a particular emphasis on Bayesian and probability modeling approaches. His work is directly motivated by challenging applications in human reproduction, neuroscience, environmental health, ecology, and genomics among others. A focus is on developing fundamentally new frameworks for statistical inferences in challenging settings, including improving robustness to modeling assumptions and scalability to large datasets. He has won numerous awards, including the 2010 COPSS Presidents’ Award, which is widely viewed as the most prestigious award in statistics and represents statistics version of the Field’s Medal, being given to one outstanding researcher under the age of 41 per year internationally. His work has had substantial impact, with ~40,000 citations on google scholar and an H-index of 67.