Topic model (TM) is a popular and important machine learning technology that has been widely used in text mining, network analysis and genetics, and more other domains. For example, in TMs, a document can be assumed to be characterized by a particular set of topics. In general, a topic is identified on the basis of supervised labeling and pruning on the basis of their likelihood of co-occurrence and has probabilities of generating various words. For example, a TM may have topics “CAT” and “DOG.” Words such as milk, meow, and kitten can be assigned to the topic “CAT,” while words such as puppy, bark, and bone can be assigned to the topic “DOG”.
Latent Dirichlet Allocation (LDA) model is an example of a topic model. LDA is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In LDA, each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. Moreover, in LDA the topic distribution is assumed to have a Dirichlet prior distribution.
Web-scale corpus are significantly more complex than smaller, well-curated document collections and thus require high-capacity topic parameter spaces featuring up to millions of topics and vocabulary words. Processing a web-scale corpus through an LDA model, however, suffers from high computational costs.