1. Field
Embodiments of the invention relate to clustering a document collection using an inverted index storing features.
2. Description of the Related Art
Clustering may be described as assignment of a set of observations into subsets. Observations in the same set are similar in some sense. Clustering may be performed bottom-up or top-down. With bottom-up clustering, each document in a set of documents is placed in one cluster, and then two or more clusters are combined to form clusters with multiple documents. With top-down clustering, the documents in the set of documents are all placed into one cluster. Then, this cluster is broken up into smaller clusters, each having one or more documents.
There are several different kinds of distance measures that may be used to determine how similar two documents are to determine whether they should be placed into a same cluster. The following are examples of distance measures: Euclidean, Manhattan, Mahalanobis, etc.
K-means clustering may be described as cluster analysis that attempts to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Hierarchical agglomerative clustering may be described as merging of clusters based on their proximity to each other.
Conventional clustering techniques are complex. Conventional clustering typically involves comparing every pair in the collection. This takes at least O(n^2) time to complete. Even for single pass algorithms, such as leader clustering, the worst case takes O(n^2) time to complete. Thus, it is difficult to perform clustering for hundreds of thousands or even million of documents in a reasonable amount of time. Even linear time clustering techniques can take an unreasonable amount of time to complete.
Clustering is typically unsupervised. That is, the document clustering/grouping is performed without using any guidance/supervision (e.g., without some example documents that have been labeled with accurate group/cluster memberships). Another issue with clustering is generating a good description of the clusters (i.e., clearly describing what similarities put the documents into the same cluster).
Thus, there is a need for an improved clustering technique.