The automatic and unsupervised discovery of topics in unlabeled data may be used to improve the performance of various kinds of classifiers (such as sentiment analysis) and natural language processing applications. Being unsupervised is both a blessing and a curse. It is a blessing because good labeled data is a scarce resource, so improving tools that depend on labeled data by extracting knowledge from the vast amounts of unlabeled data is very useful. It is a curse because the methods used to discover topics are generally computationally intensive.
A topic model—which is a probabilistic model for unlabeled data—may be used for the automatic and unsupervised discovery of topics in unlabeled data, such as a set of textual documents. Such a topic model is designed with the underlying assumption that words belong to sets of topics, where a topic is a set of words. For example, given a set of scientific papers, a topic model can be used to discover words that occur together (and therefore form a topic). One topic could include words such as “neuroscience” and “synapse”, while another topic could include words such as “graviton” and “boson”.
Topic models have many applications in natural language processing. For example, topic modeling can be a key part of text analytics such as Name Entity Recognition, Part-of-Speech Tagging, retrieval of information for search engines, etc. Unfortunately, topic modeling is generally computationally expensive, and it often needs to be applied on significant amounts of data, sometimes under time constraints.
Some prior industry solutions are based on running a so-called collapsed Gibbs sampler on a statistical model called Latent Dirichlet Allocation (LDA). This algorithm is inherently sequential. Distributed and parallel solutions based on the collapsed Gibbs sampler are generally created by approximating the algorithm; however, this only works for coarse-grained parallel architectures, and fails to make use of highly data-parallel architectures such as Graphics Processor Units (GPUs).
The latest editions of GPUs have considerable computational potential, with even more potential for computational power. However, running topic modeling tasks on a GPU is challenging because GPUs expose a computational model that is very different from ordinary CPUs (e.g., processors and multicores). As such, algorithms that work well on ordinary CPUs need to be re-designed to be data-parallel for GPUs.
The lack of parallelism in implementations of the Gibbs sampler is often compensated by taking advantage of sparsity of the data. Specifically, a key characteristic of algorithms for topic modeling is that the matrices used to compute topic, word, and document statistics are typically sparse. Taking advantage of data sparsity in a Gibbs sampler allows such an algorithm to process data quickly, notwithstanding the lack of parallelism in the algorithm.
It would be beneficial to implement a topic modeling algorithm that is both highly data-parallel and takes advantage of data sparsity in order to more efficiently create topic models.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.