1. Field of the Invention
The present invention generally relates to information retrieval using text mining and, more particularly, to a lightweight document clustering method that operates in high dimensions, processes tens of thousands of documents and groups them into several thousand clusters or, by varying a single parameter, into a few dozen clusters.
2. Background Description
The objective of document clustering is to group similar documents together, assigning them to the same implicit topic. Document clustering was originally of interest because of its ability to improve the effectiveness of information retrieval. Standard information retrieval techniques, such as nearest neighbor methods using cosine distance, can be very efficient when combined with an inverted list of word to document mappings. These same techniques for information retrieval perform a variant of dynamic clustering, matching a query or a full document to their most similar neighbors in the document database. Thus, standard information retrieval techniques are efficient and dynamically find similarity among documents, reducing the value for information retrieval purposes of finding static clusters of large numbers of similar documents. See, for example, Chapter 6, xe2x80x9cTechniquesxe2x80x9d, pp. 305-312, Readings in Information Retrieval, K. Sparck-Jones and P. Willet, editors, Morgan Kaufmann, 1997.
The advent of the web has renewed interest in clustering documents in the context of information retrieval. Instead of pre-clustering all documents in a database, the results of a query search can be clustered, with documents appearing in multiple clusters. Instead of presenting a user with a linear list of related documents, the documents can be grouped in a small number of clusters, perhaps ten, and the user has an overview of different documents that have been found in the search and their relationship within similar groups of documents. One approach to this type of visualization and presentation is described in O. Zamir, O. Etzioni, O. Madani, and R. Karp, xe2x80x9cFast and Intuitive Clustering of Web Documentsxe2x80x9d, Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Morgan Kaufman, 1997. Here again though, the direct retrieval and linear list remains effective, especially when the user is given a xe2x80x9cmore like thisxe2x80x9d option that finds a subgroup of documents representing the cluster of interest to the user.
Document clustering can be of great value for tasks other than immediate information retrieval. Among these task are
summarization and label assignment, or
dimension reduction and duplication elimination.
These concepts can be illustrated by way of a help-desk example, where users submit problems or queries online to the vendor of a product. Each submission can be considered a document. By clustering the documents, the vendor can obtain an overview of the types of problems the customers are having. For example, a computer vendor might discover that printer problems comprise a large percentage of customer complaints. If the clusters form natural problem types, they may be assigned labels or topics. New user problems may then be assigned a label and sent to the problem queue for appropriate response. Any number of methods can be used for document categorization once the appropriate clusters have been identified. Typically, the number of clusters or categories number no more than a few hundred and often less than a hundred.
Not all users of a product report unique problems to the help-desk. It can be expected that most problem reports are repeat problems, with many users experiencing the same difficulty. Given enough users who report the same problem, a FAQ (Frequently Asked Questions) report, may be created. To reduce by reducing the number of documents in the database of problem reports, redundancies in the documents must be detected. Unlike the summary of problem types, many problems will be similar but still have distinctions that are critical. Thus, while the number of clusters needed to eliminate duplication of problem reports can be expected to be much smaller than the total number of problems reports, the number of clusters is necessarily relatively large, much larger than needed for summarization of problem types.
The classical k-means technique described by J. Hartigan and M. Wong in xe2x80x9cA k-Means Clustering Algorithmxe2x80x9d, Applied Statistics, 1979, can be applied to document clustering. Its weaknesses are well known. The number of clusters k must be specified prior to application. The summary statistic is a mean of the values for each cluster. The individual members of the cluster can have a high variance and the mean may not be a good summary of the nearest neighbors that are typically found in a search procedure. As the number of clusters grow, for example to thousands of clusters, k-means clustering becomes untenable, approaching the O(n2) comparisons where n is the number of documents.
More recent attention has been given to hierarchical agglomerative methods as described by A. Griffiths, H. Luckhurst and P. Willet in xe2x80x9cUsing Interdocument Similarity Information in Document Retrieval Systemsxe2x80x9d, Readings in Information Retrieval, pp. 365-373, K. Sparck-Jones and P. Willet, editors, Morgan Kaufmann, 1997. The documents are recursively merged bottom up, yielding a decision tree of recursively partitioned clusters. The distance measures used to find similarity vary from single-link to more computationally expensive ones, but they are closely tied to nearest-neighbor distance. The algorithm works by recursively merging the single best pair of documents or clusters, making the computational costs prohibitive for document collections numbering in the tens of thousands.
To cluster very large numbers of documents, possibly with a large number of clusters, some compromises must be made to reduce dimensions of the number of indexed words and the number of expected comparisons. In B. Larsen and C. Aone, xe2x80x9cFast and Effective Text Mining Using Linear-time Document Clusteringxe2x80x9d, Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining, pp. 16-22, ACM, 1999, indexing of each document is reduced to the twenty-five highest scoring TF-IDF (term frequency and inverse document frequency) words (see G. Salton and C. Buckley, xe2x80x9cTerm-Weighting Approaches in Automatic Text Retrievalxe2x80x9d, Readings in Information Retrieval, pp. 323-328, K. Sparck-Jones and P. Willet, editors, Morgan Kaufinann, 1997), and then k-means is applied recursively, for k=9. While efficient, this approach has the classical weaknesses associated with k-means document clustering. A hierarchical technique that also works in steps with a small, fixed number of clusters is described in D. Cutting, D. Karger, J. Pedersen, and J. Tukey, xe2x80x9cScatter/Gather: a Cluster-based Approach to Browsing Large Document Collections, Proceedings of the 15th ACM SIGIR, 1992.
It is therefore an object of the present invention to provide a lightweight document clustering method that operates in high dimensions, processes tens of thousands of documents and groups them into several thousand clusters.
It is another object of the invention to provide a document clustering method of the type described wherein by varying a single parameter, the documents can be grouped into a few dozen clusters.
According to the invention, the method uses a reduced indexing view of the original documents, where only the k best keywords of each document are indexed. An efficient procedure for clustering is specified in two parts: (a) compute k most similar documents for each document-in the collection, and (b) group the documents into clusters using these similarity scores. The method is intended to operate in high dimensions with tens of thousands of documents and is capable of clustering a database into the moderate number of clusters need for summarization and label assignment or the very large number of clusters needed for the elimination of duplication. The method has been evaluated on a database of over 50,000 customer service problem reports that are reduced to 3,000 clusters and 5,000 exemplar documents. Results demonstrate efficient clustering performance with excellent group similarity measures.
The lightweight procedure of the present invention operates efficiently in high dimensions and is effective in directly producing clusters that have objective similarity. Unlike k-means clustering, the number of clusters is dynamically determined, and similarity is based on nearest-neighbor distance, not mean feature distance. The document clustering method of the present invention thus maintains the key advantage of hierarchical clustering techniques, their compatibility with information retrieval methods, and maintains performance for large numbers of documents.