Spotlight
Distributed Inference for Latent Dirichlet Allocation
David Newman · Arthur Asuncion · Padhraic Smyth · Max Welling
We investigate the problem of learning a widely-used latent-variable model -- the Latent Dirichlet Allocation (LDA) or topic model -- using distributed computation, where each of P processors only sees 1/P of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates---it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across P processors---it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using three real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors.