1. Introduction
Clustering is a fundamental tool in unsupervised learning that is used to group together similar objects [2], and has practical importance in a wide variety of applications. Recent research on data clustering increasingly focuses on cluster ensembles [15, 16, 17, 6], which seek to combine multiple clusterings of a given data set to generate a final superior clustering. It is well known that different clustering algorithms or the same clustering algorithm with different parameter settings may generate very different partitions of the same data due to the exploratory nature of the clustering task. Therefore, combining multiple clusterings to benefit from the strengths of individual clusterings offers better solutions in terms of robustness, novelty, and stability [17, 8, 15].
Distributed data mining also demands efficient methods to integrate clusterings from multiple distributed sources of features or data. For example, a cluster ensemble can be employed in privacy-preserving scenarios where it is not possible to centrally collect all the features for clustering analysis because different data sources have different sets of features and cannot share that information with each other.
Clustering ensembles also have great potential in several recently emerged data mining fields, such as relational data clustering. Relational data typically have multi-type features. For example, Web document has many different types of features including content, anchor text, URL, and hyperlink. It is difficult to cluster relational data using all multi-type features together. Clustering ensembles provide a solution to it.
Combining multiple clusterings is more challenging task than combining multiple supervised classifications since patterns are unlabeled and thus one must solve a correspondence problem, which is difficult due to the fact that the number and shape of clusters provided by the individual solutions may vary based on the clustering methods as well as on the particular view of the data presented to that method. Most approaches [15, 16, 17, 6] to combine clustering ensembles do not explicitly solve the correspondence problem. Re-labeling approach [14, 7] is an exception. However, it is not generally applicable since it makes a simplistic assumption of one-to-one correspondence.
Some early works on combining multiple clusterings were based on co-association analysis, which measure the similarity between each pair of objects by the frequency they appear in the same cluster from an ensemble. Kellam et al. [13] used the co-association matrix to find a set of so-called robust clusters with the highest value of support based on object co-occurrences. Fred [9] applied a voting-type algorithm to the co-association matrix to find the final clustering. Further work by Fred and Jain [8] determined the final clustering by using a hierarchical (single-link) clustering algorithm applied to the co-association matrix. Strehl and Ghosh proposed Cluster-Based Similarity Partitioning (CSPA) in [15], which induces a graph from a co-association matrix and clusters it using the METIS algorithm [11]. The main problem with co-association based methods is its high computational complexity which is quadratic in the number of data items, i.e., (N2).
Re-labeling approaches seek to directly solve the correspondence problem, which is exactly what makes combining multiple clusterings difficult. Dudoit [14] applied the Hungarian algorithm to re-labeling each clustering from a given ensemble with respect to a reference clustering. After overall consistent re-labeling, voting can be applied to determining cluster membership for each data item. Dimitriadou et al. [5] proposed a voting/merging procedure that combines clusterings pair-wise and iteratively. The correspondence problem is solved at each iteration and fuzzy membership decisions are accumulated during the course of merging. The final clustering is obtained by assigning each object to a derived cluster with the highest membership value. A re-labeling approach is not generally applicable since it assumes that the number of clusters in every given clustering is the same as in the target clustering.
Graph partitioning techniques have been used to solve for the clustering combination problem under different formulations. Meta-CLustering Algorithm (MCLA) [15] formulates each cluster in a given ensemble as a vertex and the similarity between two clusters as an edge weight. The induced graph is partitioned to obtain metaclusters and the weights of data items associated with the metaclusters are used to determine the final clustering. [15] also introduced HyperGraph Partitioning algorithm (HGPA), which represents each cluster as a hyperedge in a graph where the vertices correspond to data items. Then, a Hypergraph partition algorithm, such as HMETIS [10], is applied to generate the final clustering. Fern et al. [6] proposed the Hybrid Bipartite Graph Formulation (HBGF) to formulate both data items and clusters of the ensemble as vertices in a bipartite graph. A partition of this bi-partite graph partitions the data item vertices and cluster vertices simultaneously and the partition of the data items is given as the final clustering.
Another common method to solve for the clustering combination problem is to transform it into a standard clustering task by representing the given ensemble as a new set of features and then using a clustering algorithm to produce the final clustering. Topchy et al. [16] applied the k-means algorithm in the new binary feature space which is specially transformed from cluster labels of a given ensemble. It is also shown that this procedure is equivalent to maximizing the quadratic mutual information between the empirical probability distribution of labels in the consensus clustering and the labels in the ensemble. In [17], a mixture model of multinomial distributions is used to do clustering in the feature space induced by cluster labels of a given ensemble. A final clustering is found as a solution to the corresponding maximum likelihood problem using the EM algorithm.
To summarize, the problem of combining multiple clusterings has been approached from combinatorial, graph-based or statistical perspectives. However, there is no sufficient research on the core problem of combining multiple clusterings, the general correspondence problem. The main trend of the recent research is to reduce the original problem to a new clustering task which can be solved by one existing clustering algorithm, such as the hierarchical clustering, graph partitioning, k-means, and the model-based clustering. However, this procedure brings back the problems resulting from the explanatory nature of the clustering task, such as the problem of robustness. Moreover, the heuristic nature of this procedure makes it difficult to develop a unified and solid theoretic framework for ensemble clustering [3].