The present invention relates to an apparatus, a program, and a method for clustering a plurality of documents.
LDA (Latent Dirichlet Allocation) is known as an algorithm for analyzing a set of documents (“a document set”) to cluster the documents.
When a document set is analyzed by LDA, a computer has to execute a number of processing steps that is equal or greater than a square of the number of words included in the document set to be analyzed. This can make it difficult to perform LDA analysis on a document set including, for example, tens or hundreds of millions of documents existing on a network.
The LDA's processes may be executed in parallel in a distributed processing environment. However, even if the LDA's processes are parallelized in such a distributed processing environment, it would be difficult to analyze a document set including tens or hundreds of millions of documents.