In machine learning applications such as natural language processing, it is often beneficial to automatically discover topics in a corpus of documents, a technique known as topic modeling. Since the topics associated with each document in the corpus are unknown parameters or latent variables that cannot be directly sampled, some means of statistical inference is needed to approximate the topics for each document in the corpus.
One such means of statistical inference is a collapsed Gibbs sampler on a statistical model called Latent Dirichlet Allocation (LDA). However, the collapsed Gibbs sampler has the drawback of being an inherently sequential algorithm. Thus, the collapsed Gibbs sampler does not scale well for topic modeling on a large corpus, such as for enterprise databases, search engines, and other data intensive high performance computing (HPC) applications.
U.S. patent application Ser. No. 14/599,272 describes a non-collapsed or partially collapsed Gibbs sampler. Thus, rather than using a sequential collapsed Gibbs sampler, a non-collapsed or partially collapsed Gibbs sampler can be utilized to provide a scalable, parallel implementation on a highly parallel architecture such as a graphics processing unit (GPU) or another single instruction, multiple data (SIMD) architecture. The parallel LDA Gibbs sampler may be supported as one suitable algorithm for a data-parallel probabilistic programming compiler, as described in U.S. patent application Ser. No. 14/316,186.
In one phase of the LDA Gibbs sampler, new z values are randomly drawn from an appropriate discrete probability distribution. Conceptually, the phase may be considered a multi-step process of: 1) fetching parameters from main memory to create a table of relative probabilities for possible choices of the z value, 2) constructing a sum-prefix table from the table of relative probabilities, and 3) searching the sum-prefix table, for example by binary search to randomly draw the z value. When implementing this phase on a parallel architecture, a straightforward approach is to compute each z value for each discrete probability distribution in a separate thread. However, this approach presents challenges with regards to memory fetch performance. For example, if each thread merely retrieves its respective parameters in successive read cycles, then the memory accesses will tend to be scattered in different physical memory locations. Even if a cache is available, the cache will be unable to coalesce the reads from disparate memory locations.
To optimize memory locality, the read cycles may be organized such that all threads cooperate on each read cycle to read the needed parameters for one specific thread. In other words, access to the parameters is organized in transposed form, allowing the parameters to be read from contiguous memory locations. However, because each thread reads parameters that are needed by other threads, an exchange of information is necessary to finish computing the z values for each thread. For some parallel hardware architectures such as GPUs, this information exchange may require a complete matrix transposition on the fetched parameters, and then a computation of all prefix sums to fill the prefix-sum table. Unfortunately, this processing may undermine the memory fetch performance gains from using transposed access in the first instance.
Based on the foregoing, there is a need for a high performance Gibbs sampler that is suited for highly parallel architectures such as GPUs and other SIMD architectures.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.