1. Field of the Invention
The present invention generally relates to computerized datasets and more particularly to a method and system for automatically categorizing, indexing, and classifying items within such datasets.
2. Description of the Related Art
K-means is a well known algorithm for clustering (i.e. partitioning) a dataset of numeric vectors, where each numeric vector has dimensionality M and there are N such vectors. The value, K, refers to an input parameter of the algorithm that determines the number of such clusters (i.e. partitions) that the algorithm will produce at completion. In general, K-means, from a given starting point, finds a locally optimum way to cluster the dataset into K partitions so as to minimize the average difference between the mean of each cluster (cluster centroid) and every member of that cluster. This difference is measured by some distance metric, such as Euclidean distance.
In the case of text datasets, the N vectors represent text documents and dimensionality M refers to the occurrence of certain keywords or phrases in the text documents. The dictionary of keywords or phrases may be derived by counting the occurrence of all words and/or phrases in the text corpus and selecting those words and phrases that occur most often.