Document clustering refers to the partitioning of a given collection of documents into homogeneous groups which each share one or more identifiable characteristics, such as a common topic. Unsupervised clustering is required in case these characteristics are not explicitly annotated, which corresponds to the majority of practical situations. This type of document grouping is of interest in many applications, from search by content to language modeling for speech recognition.
Cluster analysis is a fundamental tool in pattern recognition and many clustering algorithms are available. They fall roughly into two categories: 1) hierarchical clustering; and 2) K-means clustering and self-organizing maps. Hierarchical clustering methods are popular because of their simplicity. Both top-down and bottom-up (also referred to as agglomerative) variants are available. Top-down approaches start with a single cluster encompassing the entire collection, and recursively split the data into increasingly smaller sub-clusters. In contrast, bottom-up methods start with each observation in a single cluster and iteratively join the closest elements into bigger ones. In both cases, once the underlying tree structure is constructed, the data can be partitioned into any number of clusters by cutting the tree at the appropriate level. Three common options for hierarchical clustering are single linkage, average linkage, and complete linkage. These options differ in their definition of the distance between two clusters.
The K-means method starts with a random assignment of K points that function as cluster centers. Each data point is then assigned to one of these centers in a way that minimizes the sum of distances between all points and their centers. Improved positions for the cluster centers are sought, and the algorithm iterates. The algorithm converges quickly for good initial choices of the cluster centers. Self-organizing maps (SOM) are closely related to the K-means procedure. The K clusters resulting from the SOM method correspond to K representative points in a prespecified geometrical configuration, such as a rectangular grid. Data points are mapped onto the grid, and the positions of the representative points are iteratively updated in a manner that eventually places each one at a cluster center. Clusters that are close to each other in the initial arrangement tend to be more similar to each other than those that are further apart.
Because each of the above techniques comes with its own caveats, clustering results vary greatly, even on the same collection. Hierarchical clustering methods share two inherent problems. First, decisions to join two elements are based solely on the distance between those elements, and once elements are joined they cannot be separated. This is a local decision-making scheme which does not consider the data as a whole, and it may lead to mistakes in the overall clustering. In addition, for large data sets, the hierarchical tree is complex, and the choice of location for cutting the tree is unclear.
As for K-means clustering, the main issue is that the number of clusters, K, must be specified prior to performing the algorithm. For the vast majority of document collections, the number of clusters is not known in advance, and the final clustering depends heavily on the choice of K. Furthermore, clusters formed by K-means do not satisfy a quality guarantee. The SOM method likewise assumes that K is specified a priori. In addition, it requires the choice of an underlying geometry. Finally, all of the above techniques typically operate on continuous data. In the case of document clustering, the data is inherently discrete. There has been a lack of efficient ways for clustering documents.