Topic modeling is useful for knowledge discovery, relevance ranking in search, and document classification. Recent years have seen significant progress on topic modeling technologies in machine learning, information retrieval, natural language processing, and other related fields. Given a collection of text documents, a topic model represents the relationship between terms and documents through latent topics. A topic is defined as a probability distribution of terms or a cluster of weighted terms. A document is then viewed as a bag of terms generated from a mixture of latent topics.
Studies on topic modeling fall into two categories: probabilistic approaches and non-probabilistic (matrix factorization) approaches. In the probabilistic approaches, a topic is defined as a probability distribution over terms and documents are defined as data generated from mixtures of topics. To generate a document, one chooses a distribution over topics. Then, for each term in that document, one chooses a topic according to the topic distribution, and draws a term from the topic according to its term distribution. For example, Probabilistic Latent Semantic Indexing (PLSI) and Latent Dirichlet Allocation (LDA) are two widely-used generative models. (See, T. Hoffman, Probabilistic Latent Semantic Indexing, SIGIR, pages 50-57, 1999; and D. Blei, A. Y. Ng, and M. I. Jordan, Latent Dirichlet Allocation, JMLR, 3:993-1022, 2003.) In non-probabilistic approaches, a term-document matrix is projected into a K-dimensional topic space in which each axis corresponds to a topic. In the topic space, each document is represented as a linear combination of the K topics. Latent Semantic Indexing (LSI) is a representative non-probabilistic model. (See, S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing By Latent Semantic Analysis, J AM SOC INFORM SCI, 41:391-407, 1990.) LSI decomposes the term-document matrix with single value decomposition (SVD) under the assumption that topics are orthogonal. See also Non-negative Matrix Factorization (NMF) methods (e.g., D. D. Lee and H. S. Seung, Learning The Parts Of Objects With Nonnegative Matrix Factorization, Nature, 401:391-407, 1999; and D. D. Lee and H. S. Seung, Algorithms For Non-Negative Matrix Factorization, NIPS 13, pages 556-562. 2001) and Sparse Coding methods (e.g., H. Lee, A. Battle, R. Raina, and A. Y. Ng, Efficient Sparse Coding Algorithms, NIPS, pages 801-808. 2007; and B. A. Olshausen and D. J. Fieldt, Sparse Coding With An Overcomplete Basis Set: A Strategy Employed By V1, VISION RES, 37:3311-3325, 1997).
One of the main challenges in topic modeling is scaling to millions or even billions of documents while maintaining a representative vocabulary of terms, which is necessary in many applications such as web search. A typical approach is to approximate the learning processes of an existing topic model.
Probabilistic topic models like LDA and PLSI are not scalable. The scalability challenge for probabilistic topic models like LDA and PLSI mainly comes from the necessity of simultaneously updating the term-topic matrix to meet the probability distribution assumptions. When the number of terms is large, which is inevitable in real applications, this problem becomes particularly severe. For LSI, the scalability challenge is due to the orthogonality assumption in the formulation, and as a result the problem needs to be solved by Singular Value Decomposition (SVD) and thus is hard to be parallelized.
Most efforts to improve topic modeling scalability have modified existing learning methods, such as LDA. Newman, et al. proposed Approximate Distributed LDA (AD-LDA), in which each processor performs a local Gibbs sampling iteration followed by a global update. (D. Newman, A. Asuncion, P. Smyth, and M. Welling, Distributed Inference For Latent Dirichlet Allocation, NIPS, 2008.) Two recent papers implemented AD-LDA as PLDA and modified AD-LDA as PLDA+, using MPI and MapReduce. (See, Y. Wang, H. Bai, M. Stanton, W. Yen Chen, and E. Y. Chang, PLDA: Parallel Latent Dirichlet Allocation For Large-Scale Applications, AAIM, pages 301-314, 2009; Z. Liu, Y. Zhang, and E. Y. Chang. PLDA+: Parallel Latent Dirichlet Allocation With Data Placement And Pipeline Processing, TIST, 2010; R. Thakur and R. Rabenseifner, Optimization Of Collective Communication Operations In MPICH, INT J HIGH PERFORM C, 19:49-66, 2005; and J. Dean, S. Ghemawat, and G. Inc, Mapreduce: Simplified Data Processing On Large Clusters, OSDI, 2004.). L. AlSumait, D. Barbara, and C. Domeniconi, the authors of On-Line LDA: Adaptive Topic Models For Mining Text Streams With Applications To Topic Detection And Tracking, ICDM, 2008, proposed purely asynchronous distributed LDA algorithms based on Gibbs Sampling or Bayesian inference, called Async-CGB or Async-CVB, respectively. In Async-CGB and Async-CVB, each processor performs a local computation step followed by a step of communicating with other processors. In all the methods, the local processors need to maintain and update a dense term-topic matrix, usually in memory, which becomes a bottleneck for improving the scalability. Similarly, online versions of stochastic LDA have been proposed. (See, L. AlSumait, D. Barbara, and C. Domeniconi, On-Line LDA: Adaptive Topic Models For Mining Text Streams With Applications To Topic Detection And Tracking, ICDM, 2008; and M. D. Hoffman, D. M. Blei, and F. Bach, Online Learning For Latent Dirichlet Allocation, NIPS, 2010.)
Sparse methods have recently received a lot of attention in machine learning community. These methods aim to learn sparse representations (simple models) hidden in the input data by using l1 norm regularization. Sparse Coding algorithms are proposed which can be used for discovering basis functions, to capture meta-level features in the input data. See, for example, H. Lee, A. Battle, R. Raina, and A. Y. Ng, Efficient Sparse Coding Algorithms, NIPS, pages 801-808. 2007; and B. A. Olshausen and D. J. Fieldt, Sparse Coding With An Overcomplete Basis Set: A Strategy Employed By V1, VISION RES, 37:3311-3325, 1997. One justification to the sparse methods is that human brains have similar sparse mechanism for information processing. For example, when Sparse Coding algorithms are applied to natural images, the learned bases resemble the receptive fields of neurons in the visual cortex. (B. A. Olshausen and D. J. Fieldt, Sparse Coding With An Overcomplete Basis Set: A Strategy Employed By V1, VISION RES, 37:3311-3325, 1997.) Previous work on sparse methods mainly focused on image processing. (R. Rubinstein, M. Zibulevsky, and M. Elad, Double Sparsity: Learning Sparse Dictionaries For Sparse Signal Approximation, IEEE T SIGNAL PROCES, pages 1553-1564, 2008.) The use of sparse methods for topic modeling was also proposed very recently by Chen et al. (. Chen, B. Bai, Y. Qi, Q. Lin, and J. Carbonell, Sparse Latent Semantic Analysis, NIPS Workshop, 2010.) Their motivation was not to improve scalability and they made an orthogonality assumption (requiring an SVD). C. Wang and D. M. Blei have proposed to discover sparse topics based on a modified version of LDA. (C. Wang and D. M. Blei, Decoupling Sparsity And Smoothness In The Discrete Hierachical Dirichlet Process, NIPS, 2009.)