This invention relates generally to latent Dirichlet allocation (“LDA”) analysis of a dataset to discover themes or topics, and more particularly to parallel LDA analysis of a distributed dataset comprising a large collection of unstructured data, referred to herein as documents, in a shared-nothing massively parallel processing (MPP) database.
Documents of a dataset can be represented as random mixtures of latent topics, where each topic may be characterized by a probability distribution over a vocabulary of data elements such as words. Documents comprise collections of words, and each document may comprise multiple topics. Given a large corpus of text, i.e., a dataset, LDA can infer a set of latent topics from the corpus, each topic being represented as a multinomial distribution over words denoted as P(w/z), and can infer the topic distribution for each document represented as a multinomial distribution over topics denoted as P(z/d). All of the documents in a corpus share the same set of topics, but each document has a different mix (distribution) of topics. Gibbs sampling has been widely used for the inference of LDA because it is simple, fast, has few adjustable parameters, and is easy to parallelize and scale.
Most existing LDA implementations are built upon MPI or Map/Reduce that read/write data from/to file systems, including local file systems, networked file systems, and distributed file systems like a Hadoop distributed file system (HDFS). LDA has a large memory requirement since it is necessary to aggregate results in a memory for processing. MPI and Map/Reduce are batch processing systems, and, as such, they can manipulate memory to meet the memory requirements without disrupting other ongoing processing tasks. This is not true for relational databases. There are no in-database SQL-like implementations of LDA for relational databases (RDBMS), particularly not for large distributed shared-nothing MPP databases. In contrast to reading and writing data in file systems, databases read and write data in parallel in tables using queries, which should not consume too much memory. Furthermore, Hadoop and other batch processing systems have parallel mechanisms that are different from those of databases, and batch processing implementations of LDA for file systems are not readily adaptable to databases.
It is desirable to provide scalable memory efficient parallel LDA implementations in shared-nothing MPP databases to enable in-database topic modeling and topic-based data analytics, and it is to these ends that the present invention is directed.