1. Introduction
Most clustering approaches in the literature focus on “flat” data in which each data object is represented as a fixed-length attribute vector [38]. However, many real-world data sets are much richer in structure, involving objects of multiple types that are related to each other, such as documents and words in a text corpus, Web pages, search queries and Web users in a Web search system, and shops, customers, suppliers, shareholders and advertisement media in a marketing system.
In general, relational data contain three types of information, attributes for individual objects, homogeneous relations between objects of the same type, heterogeneous relations between objects of different types. For example, for a scientific publication relational data set of papers and authors, the personal information such as affiliation for authors are attributes; the citation relations among papers are homogeneous relations; the authorship relations between papers and authors are heterogeneous relations. Such data violate the classic independently and identically distributed (IID) assumption in machine learning and statistics and present huge challenges to traditional clustering approaches. An intuitive solution is that we transform relational data into flat data and then cluster each type of objects independently. However, this may not work well due to the following reasons.
First, the transformation causes the loss of relation and structure information [14]. Second, traditional clustering approaches are unable to tackle influence propagation in clustering relational data, i.e., the hidden patterns of different types of objects could affect each other both directly and indirectly (pass along relation chains). Third, in some data mining applications, users are not only interested in the hidden structure for each type of objects, but also interaction patterns involving multi-types of objects. For example, in document clustering, in addition to document clusters and word clusters, the relationship between document clusters and word clusters is also useful information. It is difficult to discover such interaction patterns by clustering each type of objects individually.
Moreover, a number of important clustering problems, which have been of intensive interest in the literature, can be viewed as special cases of relational clustering. For example, graph clustering (partitioning) [7, 42, 13, 6, 20, 28] can be viewed as clustering on singly-type relational data consisting of only homogeneous relations (represented as a graph affinity matrix); co-clustering [12, 2] which arises in important applications such as document clustering and micro-array data clustering, can be formulated as clustering on bi-type relational data consisting of only heterogeneous relations.
Recently, semi-supervised clustering [46, 4] has attracted significant attention, which is a special type of clustering using both labeled and unlabeled data. Therefore, relational data present not only huge challenges to traditional unsupervised clustering approaches, but also great need for theoretical unification of various clustering tasks.
2. Related Work
Clustering on a special case of relational data, bi-type relational data consisting of only heterogeneous relations, such as the word-document data, is called co-clustering or bi-clustering. Several previous efforts related to co-clustering are model based [22, 23]. Spectral graph partitioning has also been applied to bi-type relational data [11, 25]. These algorithms formulate the data matrix as a bipartite graph and seek to find the optimal normalized cut for the graph.
Due to the nature of a bipartite graph, these algorithms have the restriction that the clusters from different types of objects must have one-to-one associations. Information-theory based co-clustering has also attracted attention in the literature. [12] proposes a co-clustering algorithm to maximize the mutual information between the clustered random variables subject to the constraints on the number of row and column clusters. A more generalized co-clustering framework is presented by [2] wherein any Bregman divergence can be used in the objective function. Recently, co-clustering has been addressed based on matrix factorization. [35] proposes an EM-like algorithm based on multiplicative updating rules.
Graph clustering (partitioning) clusters homogeneous data objects based on pairwise similarities, which can be viewed as homogeneous relations. Graph partitioning has been studied for decades and a number of different approaches, such as spectral approaches [7, 42, 13] and multilevel approaches [6, 20, 28], have been proposed. Some efforts [17, 43, 21, 21, 1] based on stochastic block modeling also focus on homogeneous relations. Compared with co-clustering and homogeneous-relation-based clustering, clustering on general relational data, which may consist of more than two types of data objects with various structures, has not been well studied in the literature. Several noticeable efforts are discussed as follows. [45, 19] extend the probabilistic relational model to the clustering scenario by introducing latent variables into the model; these models focus on using attribute information for clustering. [18] formulates star-structured relational data as a star-structured m-partite graph and develops an algorithm based on semi-definite programming to partition the graph. [34] formulates multi-type relational data as K-partite graphs and proposes a family of algorithms to identify the hidden structures of a k-partite graph by constructing a relation summary network to approximate the original k-partite graph under a broad range of distortion measures.
The above graph-based algorithms do not consider attribute information. Some efforts on relational clustering are based on inductive logic programming [37, 24, 31]. Based on the idea of mutual reinforcement clustering, [51] proposes a framework for clustering heterogeneous Web objects and [47] presents an approach to improve the cluster quality of interrelated data objects through an iterative reinforcement clustering process. There are no sound objective function and theoretical proof on the effectiveness and correctness (convergence) of the mutual reinforcement clustering. Some efforts [26, 50, 49, 5] in the literature focus on how to measure the similarities or choosing cross-relational attributes.
To summarize, the research on relational data clustering has attracted substantial attention, especially in the special cases of relational data. However, there is still limited and preliminary work on general relational data clustering.