1. Field of Invention
The present invention relates generally to the field of data clustering. More specifically, the present invention is related to model selection for improving document clustering.
2. Discussion of Prior Art
Unsupervised learning is an attempt to determine the intrinsic structure in data and is often viewed as finding clusters in a data set. Clustering is an important tool in the analysis of data with applications in several domains such as psychology, humanities, clinical diagnosis, pattern recognition, information retrieval, etc. Model selection in clustering, that is, how to determine adjustments to a number of model parameters, has proven to be particularly challenging. Therefore, there is clearly a need for a system that performs clustering in different feature spaces.
The following references describe prior art in the field of data clustering. The prior art described below does not however relate to the present invention""s method of model selection via a unified objective function whose arguments include the feature space and number of clusters.
U.S. Pat. No. 5,819,258 discloses a method and apparatus for automatically generating hierarchal categories from large document collections. Vaithyanathan et al. provide for a top-down document clustering approach wherein clustering is based on extracted features, derived from one or more tokens. U.S. Pat. No. 5,857,179, also by Vaithyanathan et al. provide for a computer method and apparatus for clustering documents and automatic generation of cluster keywords and further teach a document represented by an M dimensional vector wherein the vectors in turn are clustered.
U.S. Pat. No. 5,787,420 provides for a method of ordering document clusters without requiring knowledge of user interests. Tukey et al. teach a document cluster ordering based on similarity between clusters. U.S. Pat. No. 5,787,422, also by Tukey et al. provides for a method and apparatus for information access employing overlapping clusters and suggests document clustering based on a corpus of documents.
U.S. Pat. No. 5,864,855 provides for a parallel document clustering process wherein a document is converted to a vector and compared with clusters.
In addition, U.S. Pat. Nos. 5,873,056, 5,844,991, 5,442,778, 5,483,650, 5,625,767, and 5,808,615 provide general teachings relating to prior art document clustering methods.
An article by Rissanen et al. entitled, xe2x80x9cUnsupervised Classification With Stochastic Complexityxe2x80x9d, published in the US/Japan Conference on the Frontiers of Statistical Modeling, 1992, discloses that postulating too many parameters leads to overfitting, thereby distorting the density of the underlying data.
An article by Kontkanen et al. entitled, xe2x80x9cComparing Bayesian Model Class Selection Criteria by Discrete Finite Mixturesxe2x80x9d, published in the Proceedings of the ISIS ""96 Conference, suggests the difficulty in choosing an xe2x80x9coptimalxe2x80x9d order associated with clustering applications. An article by Smyth entitled, xe2x80x9cClustering Using Monte Carlo Cross-Validationxe2x80x9d, published in Knowledge Discovery in Databases, 1996, talks along the same lines of the reference by Kontkanen et al.
An article by Ghosh-Roy et al. entitled, xe2x80x9cOn-line Legal Aid: Markov Chain Model for Efficient Retrieval of Legal Documentsxe2x80x9d, published in Image and Vision Computing, 1998, teaches data clustering and clustered searching.
An article by Chang et al. entitled, xe2x80x9cIntegrating Query Expansion and Conceptual Relevance, Feedback for Personalized Web Information Retrievalxe2x80x9d. Chang et al. suggest key word extraction for cluster digesting and query expansion.
All the prior art discussed above has addressed model selection from the point of view of estimating the optimal number of clusters. This art fails to consider clustering within different feature spaces. Whatever the precise merits, features, and advantages of the above cited references, none of them achieves or fulfills the purposes of the present invention. They fail to provide for considering the interplay of both the number of clusters and the feature subset in evaluating clustering models. Without this consideration, the prior art also fails to provide an objective method of comparing two models in different feature spaces.
The present invention provides for a system for model selection in unsupervised learning with applications to document clustering. The current system provides for a better model structure determination by determining both the optimal number of clusters and the optimal feature set.
The problem of model selection to determine both the optimal clusters and the optimal feature set is analyzed in a Bayesian statistical estimation framework and a solution is described via an objective function. The maximization of the said objective function corresponds to an optimal model structure. A closed-form expression for a document clustering problem and the heuristics that help find the optimum (or at least sub-optimum) objective function in terms of feature sets and the number of clusters are also developed.