The automatic and unsupervised discovery of topics in unlabeled data may be used to improve the performance of various kinds of classifiers (such as sentiment analysis) and natural language processing applications. Being unsupervised is both a blessing and a curse. It is a blessing because good labeled data is a scarce resource, so improving tools that depend on labeled data by extracting knowledge from the vast amounts of unlabeled data is very useful. It is a curse because the methods used to discover topics are generally computationally intensive.
A topic model—which is a probabilistic model for unlabeled data—may be used for the automatic and unsupervised discovery of topics in unlabeled data, such as a set of textual documents. Such a topic model is designed with the underlying assumption that words belong to sets of topics, where a topic is a set of words. For example, given a set of scientific papers, a topic model can be used to discover words that occur together (and therefore form a topic). One topic could include words such as “neuroscience” and “synapse”, while another topic could include words such as “graviton” and “boson”.
Topic models have many applications in natural language processing. For example, topic modeling can be a key part of text analytics such as Name Entity Recognition, Part-of-Speech Tagging, retrieval of information for search engines, etc. Topic modeling, and latent Dirichlet allocation (LDA) in particular, has become a must-have of analytics platforms and consequently it needs to be applied to larger and larger datasets.
Many times, applying methods of topic modeling to very large data sets, such as billions of documents, takes a prohibitive amount of time. As such, it would be beneficial to implement a topic modeling algorithm that produces good topic modeling results in less time.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.