A. Field of the Invention
The invention relates to the field of clustering data objects stored in a data processing system, particularly in a relational database.
B. Related Art
Clustering on multi-type relational data has attracted more and more attention in recent years due to its high impact on various important applications, such as Web mining, e-commerce and bioinformatics. Clustering of objects in a relational database is useful in that it makes searching more efficient. The paper Long et al, “Co-clustering by Block Value Decomposition” Bo Long, Zhongfei (Mark) Zhang, Philip S. Yu, in KDD2005 discusses a method for clustering data objects referred to as block value decomposition. This method has some limitations when applied to relationships between multiple types of data, because it focuses on only one matrix relating data objects.
Most clustering approaches in the literature focus on “flat” data in which each data object is represented as a fixed length feature vector (R. O. Duda et al., 2000). However, many real-world data sets are much richer in structure, involving objects of multiple types that are related to each other, such as Web pages, search queries and Web users in a Web search system, and papers, key words, authors and conferences in a scientific publication domain. In such scenarios, using traditional methods to cluster each type of object independently may not work well due to the following reasons.
First, to make use of relation information under the traditional clustering framework, the relation information needs to be transformed into features. In general, this transformation causes information loss and/or very high dimensional and sparse data. For example, if we represent the relations between Web pages and Web users as well as search queries as the features for the Web pages, this leads to a huge number of features with sparse values for each Web page. Second, traditional clustering approaches are unable to tackle the interactions among the hidden structures of different types of objects, since they cluster data of a single type based on static features. Note that the interactions could pass along the relations, i.e., there exists influence propagation in multi-type relational data. Third, in some machine learning applications, users are not only interested in the hidden structure for each type of object, but also the global structure involving multiple types of objects. For example, in document clustering, except for document clusters and word clusters, the relationship between document clusters and word clusters is also useful information. It is difficult to discover such global structures by clustering each type of object individually.
Spectral clustering (Ng et al., 2001; Bach & Jordan, 2004) has been well studied in the literature. The spectral clustering methods based on the graph partitioning theory focus on finding the best cuts of a graph that optimize certain predefined criterion functions. The optimization of the criterion functions usually leads to the computation of singular vectors or Eigenvectors of certain graph affinity matrices. Many criterion functions, such as the average cut (Chan et al., 1993), the average association (Shi & Malik, 2000), the normalized cut (Shi & Malik, 2000), and the min-max cut (Ding et al., 2001), have been proposed.
Spectral graph partitioning has also been applied to a special case of multi-type relational data, bi-type relational data such as the word-document data (Dhillon, 2001; H. Zha & H. Simon, 2001). These algorithms formulate the data matrix as a bipartite graph and seek to find the optimal normalized cut for the graph. Due to the nature of a bipartite graph, these algorithms have the restriction that the clusters from different types of objects must have one-to-one associations.
Clustering on bi-type relational data is called co-clustering or bi-clustering. Recently, co-clustering has been addressed based on matrix factorization. Both Long et al. (2005) and Li (2005) model the co-clustering as an optimization problem involving a triple matrix factorization. Long et al. (2005) propose an EM-like algorithm based on multiplicative updating rules and Li (2005) proposes a hard clustering algorithm for binary data. Ding et al. (2005) extend the non-negative matrix factorization to symmetric matrices and show that it is gives the same results as the Kernel K-means and the Laplacian-based spectral clustering. Several previous efforts related to co-clustering are model based. PLSA (Hofmann, 1999) is a method based on a mixture decomposition derived from a latent class model. A two-sided clustering model is proposed for collaborative filtering by Hofmann and Puzicha (1999). Information-theory based co-clustering has also attracted attention in the literature. El-Yaniv and Souroujon (2001) extend the information bottleneck (TB) framework (Tishby et al., 1999) to repeatedly cluster documents and then words. Dhillon et al. (2003) propose a co-clustering algorithm to maximize the mutual information between the clustered random variables subject to the constraints on the number of row and column clusters. A more generalized co-clustering framework is presented by Banerjee et al. (2004) wherein any Bregman divergence can be used in the objective function.
Comparing with co-clustering, clustering on general relational data, which may consist of more than two types of data objects, has not been well studied in the literature. Several noticeable efforts are discussed as follows. Taskar et al. (2001) extend the probabilistic relational model to the clustering scenario by introducing latent variables into the model. Gao et al. (2005) formulate star-structured relational data as a star-structured m-partite graph and develop an algorithm based on semi-definite programming to partition the graph. Like bipartite graph partitioning, it has limitations that the clusters from different types of objects must have one-to-one associations and it fails to consider the feature information.
An intuitive idea for clustering multiple types of interrelated objects is the mutual reinforcement clustering. The idea works as follows: start with initial cluster structures of the data; derive the new reduced features from the clusters of the related objects for each type of object; based on the new features, cluster each type of object with a traditional clustering algorithm; go back to the second step until the algorithm converges. Based on this idea, Zeng et al. (2002) propose a framework for clustering heterogeneous Web objects and Wang et al. (2003) present an approach to improve the cluster quality of interrelated data objects through an iterative reinforcement clustering process. However, there is no sound objective function and theoretical proof on the effectiveness and correctness (convergence) of the mutual reinforcement clustering.
To summarize, the research on multi-type relational data clustering has attracted substantial attention, especially in the special cases of relational data. However, there is still limited and preliminary work on the general relational data.
See also, U.S. Published Patent Application Nos. 20070067281; 20060270918; 20060235812; 20060200431; 20060179021; 20060080059; 20060050984; 20060045353; 20060015263; 20050286774; 20050285937; 20050278352; 20050270285; 20050251532; 20050246321; 20050149230; 20050141769; 20050110679; 20040267686; 20040054542; 20030081833; 20020116196, each of which is expressly incorporated herein by reference. See also, U.S. Pat. Nos. 7,006,944; 6,895,115; 6,070,140; 5,806,029; 5,794,192; 5,664,059; 5,590,242; 5,479,572; 5,274,737, each of which is expressly incorporated herein by reference. See e.g., Spectral Clustering, ICML 2004 Tutorial by Chris Ding crd.lbl.gov/˜cding/Spectral/, crd.lbl.gov/˜cding/Spectral/notes.html, Spectral Clustering www.ms.washington.edu/˜spectral/ (and papers cited therein), A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” In Advances in Neural Information Processing Systems 14: Proceedings of the 2001”, citeseer.ist.psu.edu/ng01spectral.html; Francis R. Bach, Michael I. Jordan. Learning spectral clustering, Advances in Neural Information Processing Systems (NIPS) 16, 2004, cmm.ensmp.fr/˜bach/nips03_cluster.pdf (See also cmm.ensmp.fr/˜bach/); Ulrike von Luxburg, Max Planck Institute for Biological Cybernetics Technical Report No. TR-149, “A Tutorial on Spectral Clustering,” August 2006 www.kyb.mpg.de/publications/attachments/Luxburg06_TR_%5B0%5D.pdf.