Spotlight
A Spectral Algorithm for Latent Dirichlet Allocation
Anima Anandkumar · Dean P Foster · Daniel Hsu · Sham M Kakade · Yi-Kai Liu

Wed Dec 5th 05:56 -- 06:00 PM @ Harveys Convention Center Floor, CC
Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by \emph{multiple} latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (\emph{i.e.}, third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on $k \times k$ matrices, where $k$ is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space.

Author Information

Anima Anandkumar (Caltech)
Dean P Foster (University of Pennsylvania)
Daniel Hsu (Columbia University)
Sham M Kakade (Microsoft Research)
Yi-Kai Liu (National Institute of Standards and Technology)

More from the Same Authors