The present invention relates generally to classifying information. More particularly, the invention provides a method and system for clustering, such as clustering of documents, for analysis purposes. In a specific aspect, the present invention provides a way of classification that is to correctly associate items (e.g., documents) to be classified with one or more appropriate pre-defined categories, which define the items based upon aspects of an initial organization structure. Clustering can be used to group items into clusters, which serve as categories. Although the invention has been described in terms of documents, it has a much broader range of applicability. For example, the invention can be applied to images, DNA sequences, purchase transactions, financial records, and species descriptions.
Information should be organized to be useful. Such organization would often allow relevant information to be found when it is needed. Filing systems, such as card catalogues, are examples of information organization technology. Information is often classified by category and information about the same category is grouped together. The classification can also be recorded in an electronic system, rather than a card catalogue. Classification is valuable not only for books or other physical documents, as in the card catalogue case, but also for electronic documents, such as web pages and presentations, as well as for other kinds of items, such as images and data points. In these examples, determining the appropriate classification for information can be a challenge.
Automated classification technology can reduce the human effort otherwise required to classify items. Learning based automatic classification systems take as input a set of categories and a set of training examples, items that should belong to each category. They use the training data to build a model relating the features of an item to the categories it should belong to. They then use this model to automatically classify new items into the appropriate categories, often achieving high reliability. Techniques for performing such classification are described in Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York, N.Y.: Wiley.
For example, a company may have collected together thousands of documents pertinent to the company, for the purpose of sharing them among the employees of the company. But employees will be much more likely to find documents of interest to them if the documents are classified by category. In this case, the items to be classified are the documents, and the categories could be a variety of categories that the documents might be about. The features of a document could be the frequency of occurrence of each of the words that occur in the document. The model could describe, for each category, typical word frequencies of documents about that category. The system would classify a new document into the category or categories with the most similar word frequencies.
These classification systems are known as supervised learning systems, because they generally attempt to reproduce the results of the training examples. A disadvantage of such systems is that they require both categories and training examples as inputs, and it may require extensive human labor to provide these categories and training examples.
An alternative to relying on training data is so call “unsupervised” approaches. For example, clustering algorithms attempt to automatically organize items into groups or a hierarchy based only on similarities of the features of the items. For example, they would group together documents with similar words. Since they do not often require training data, they require less human supplied information than classification systems. On the other hand, since they are not supervised, the clusters they find may not correspond to meaningful groupings that humans would have made. Further human intervention is typically required to understand and name resulting clusters, so they can form the basis of categorization useful to humans.
Related art for clustering can be found at Fraley, C. and A. E. Raftery, How many clusters? Which clustering method? Answers via model-based cluster analysis, Computer Journal, 41, 578—588, 1998; M. Iwayama and T. Tokunaga. Hierarchical bayesian clustering for automatic text classification. In Proceedings of the International Joint Conference on Artificial Intelligence, 1995; C. Fraley. Algorithms for model-based Gaussian hierarchical clustering. SIAM Journal on Scientific Computing, 20:270-281, 1999; Jane & Dubes, Algorithms for Clustering Data. Prentice Hall, 1988 P. Willett, Document Clustering Using an Inverted File Approach, Journal of Information Science, Vol. 2 (1980), pp. 223-31; Hofmann, T. and Puzicha, J. Statistical models for co-occurrence data. AI-MEMO 1625, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (1998); 5,832,182 Zhang, et al. Method and system for data clustering for very large databases; 5,864,855 Ruocco, et al. Parallel document clustering process; 5,857,179 Vaithyanathan, et al. Computer method and apparatus for clustering documents and automatic generation of cluster keywords.
Another example of a conventional technique is described in “Learning to Classify Text from Labeled and Unlabeled Documents”, Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98). This paper shows how to combined labeled documents from a training set with unlabeled documents to create a superior training set. But the approach in the paper does not alter the starting taxonomy. It adds new documents to the starting taxonomy, but does not create new categories. In other words, it is concerned generally with improving classification by adding unlabeled items, which is limiting.
From the above, it is seen that an improved way for organizing information is highly desirable.