The present application relates generally to data processing. It finds particular application in conjunction with distributed Gibbs sampling and will be described with particular reference thereto. However, it is to be understood that it also finds application in any iterative process which uses and updates globally aggregated statistics over distributed data and is not necessarily limited to the aforementioned application.
Gibbs sampling is a key method for computing a posterior distribution over latent variables given some evidence. For instance, one might have the hypothesis that within a population of consumers there are two distinct subgroups, perhaps gender based, that differ in their mean buying rates. When group membership is not explicitly labeled, it needs to be inferred from other features of the observed data. Gibbs sampling could be used to estimate the probability that a user belongs to a particular group and to infer the average buying rates for each group. Further, Gibbs sampling has been instrumental in opening up new classes of hierarchical models that encode a wide variety of assumptions.
Now that data sets have become available for millions of users engaging in many activities over time, it is desirable to apply these models to these huge datasets. However, the data is no longer able to fit in the main or primary memory (e.g., random access memory (RAM)) of a single computer. Virtual memory could be used to store the data partly on secondary memory (e.g., a hard drive) as needed. However, Gibbs sampling methods require iterating over all of the data frequently, which would lead to processes quickly becoming input/output (I/O) bound.
One solution is to distribute the computation over multiple compute nodes (e.g., computers), each of which keeps a portion of the data completely in main memory. Global statistics are calculated by sending messages between compute nodes, which aggregate the messages to get a local estimate of global statistics. Distributing Gibbs sampling requires that the compute nodes exchange statistics on a regular basis to create global statistics reflecting the current beliefs about the data as a whole.
In the most naïve approach for exchanging statistics, each processor p keeps an array of statistics (e.g., counts) for the data Vp representing its beliefs about the global dataset. Further, each processor p can receive an update from processor q with new counts Vq′. On receiving new counts Vq′ from processor q, processor p subtracts out the old counts Vq from processor q and adds in the new counts Vq′ to get updated counts Vp′ as follows: Vp′=Vp−Vq+Vq′. In this way, processor p maintains an estimate of the statistic Vp that is informed by all processors.
A challenge with exchanging statistics as described above is that it requires each receiving processor to store the previous counts received from the sending processor (i.e., Vq) to know what should be subtracted. When there are a large number of processors this will require storing 1000 s or 10000 s of copies of V, which is untenable.
One solution, described in Asuncion et al., “Distributed Gibbs Sampling for Latent Variable Models”, Scaling Up Machine Learning, Cambridge University Press, 2012, is to use a sampling process to approximately remove the prior counts previously received from processor q. A sample the same size as the message from q is randomly drawn using processor p's current global generative model of the corpus and subtracted from p's global estimate. Processor q's counts are then added in to p's global estimate Vp. One can make a rough analogy here with exponential averaging in which a mean is updated incrementally with a new value: Vp=Vp−α*Vp+α*Vq.
This solution is attractive as it does not require any additional memory beyond that required to store the statistics Vp. However, the exponential average assumes that all of the samples come from the same distribution, which is not necessarily true. The different processors may contain subsets of data drawn from different sources causing them to have very different distributions. This solution is therefore only approximate and can lead to significant statistical biases that slow or even prevent convergence.
The present application provides a new and improved system and method which overcome the above-referenced problems and others.