1. Field
This disclosure relates to analysis and evaluation of objects such as computer readable documents and files to determine whether and the extent to which they are related, and clustering or grouping the related objects, documents, and files.
2. Description of the Related Art
Clustering is the process of grouping together objects with similar features so that the similarity among the objects (for example, documents) in a group is greater than the similarity of objects (for example, documents) between groups. The generic term for something that may be analyzed for relatedness is an object. A group of objects may be analyzed and related objects may be grouped into clusters. Documents are an example of the kind of objects that may be clustered using the techniques described herein.
There are a variety of ways to define similarity, but one way is to count the number of words that overlap between each pair of documents. The more words they have in common, the more likely they are to be about the same thing.
If the documents to be clustered are represented as vectors, then cosine similarity, cross product, or Euclidean distance metrics can be used. Distance is the inverse or complement of similarity. The more similar a pair of documents is, the lower their distance from one another. That is, they are closer. In the vector space model, each word in the vocabulary is represented by a position in the vector. If a word is present in the document, then the corresponding element of the vector is set to be nonzero. If the word is absent for a particular document, then the corresponding element of the vector is set to 0.
Many clustering tools build hierarchical clusters of documents. Some organize the documents into a fixed number of clusters (e.g., K-means clustering). In many cases, the clustering algorithms start with a randomly selected set of documents to serve as seeds for each cluster. Then each additional document is put into the same cluster as the most similar one already processed. The organization of the clusters may depend on exactly which documents were randomly chosen and may change from one run of the clustering algorithm to the next.
Commonly used procedures for constructing clusters can be categorized as either agglomerative or partitional. In the agglomerative approach, clusters start small, typically with only one document. Clusters are then built by adding documents to existing clusters. The partitional approach typically starts with one cluster, and each cluster is thereafter subdivided to make new smaller clusters.
Documents can be joined to a cluster based on single linkage, in which the distance to cluster is the measured as the distance to closest element of the cluster; complete linkage, which measures the distance to the farthest element of the cluster, or average linkage, which measures the distance to the centroid or average member of the cluster.
The most commonly used clustering algorithms are limited in their usefulness by their computational complexity. Many clustering algorithms take time O(n2) or O(n3), with many iterations through the data to cluster the n documents. This time requirement makes them impractical for use in large data sets.
Another limitation of most clustering algorithms is that they are designed to put every document into a cluster. The goal is to put every document into the nearest, most similar cluster, but the more distant a document is from the cluster's center, the less like the cluster it is. This is most obvious in cluster schemes that start with all documents in a single cluster. At that point cluster membership provides almost no information about the content of a document. K-means clustering is designed to put all documents into a set of k clusters, where k is determined before clustering begins. It is often not obvious what the right number of clusters should be.
The two biggest contributors that cause traditional clustering algorithms to suffer their limitations are representing documents as vectors, and having to iterate repeatedly over a document set to handle the cluster assignments. The clustering methods described herein overcome the limitations imposed by these two factors.