The following relates to the document processing arts. It is described with example reference to embodiments employing probabilistic hierarchical clustering in which documents are represented by a bag-of-words format. However, the following is also applicable to non-hierarchical probabilistic clustering, to other types of clustering, and so forth.
In typical clustering systems, a set of documents is processed by a training algorithm that classifies the documents into various classes based on document similarities and differences. For example, in one approach the documents are represented by a bag-of-words format in which counts are stored for keywords or for words other than certain frequent and typically semantically uninteresting stop words (such as “the”, “an”, “and”, or so forth). Document similarities and differences are measured in terms of the word counts, ratios, or frequencies, and the training partitions documents into various classes based on such similarities and differences. The training further generates probabilistic model parameters indicative of word counts, ratios, or frequencies characterizing the classes. For example, a ratio of the count of each word in the documents of a class respective to the total count of words in the documents of the class provides a word probability or word frequency modeling parameter. Optionally, the classes are organized into a hierarchy of classes, in which the documents are associated with leaf classes and ancestor classes identify or associate semantically or logically related groupings of leaf classes. Once the training is complete, the clustering system can be used to provide a convenient and intuitive interface for user access to the clustered documents.
A problem arises, however, in that the classification system generated by the initial cluster training is generally static. The probabilistic modeling parameters are computed during the initial training based on counting numbers of words in documents and classes. If a document in the clustering system is moved from one class to another class, or if a class is split or existing classes are merged, or so forth, then the probabilistic modeling parameters computed during the training are no longer accurate.
To maintain up-to-date probabilistic modeling parameters, the clustering system can be retrained after each update (such as after each document or class move, after each class split or merge, or so forth). However, a large clustering system may contain tens or hundreds of thousands of documents, or more, with each document containing thousands, tens of thousands, or more words. Accordingly, re-training of the clustering system is typically a relatively slow proposition. For document bases of tens of hundreds of documents each including thousands or tens of thousands of words, retraining can take several minutes or longer. Such long time frames are not conducive to performing real-time updates of the hierarchy of classes. Moreover, the effect of such retraining will generally not be localized to the moved documents or classes that have been moved, merged, split, or otherwise updated. Rather, a retraining of the clustering system to account for updating of one region of the hierarchy of classes may have unintended consequences on other regions that may be far away from the updated region.