The present invention relates to methods and systems for partitioning, or clustering, data into subsets of related data and, in particular, to a method and system for projecting a graph onto a clique graph that has applications in clustering gene expression patterns, methylation profiles, hybridization signals, and other types of experimental and observational data in scientific and technical fields and in economics.
A frequently encountered problem in processing data generated by scientific experimentation and scientific and economic observation is that of partitioning data into related data subsets. For example, stock market analysts attempt to identify groups of stocks that rise and fall in price together in response to various cycles and trends. The observed data are the prices of each stock over a period of time, and the partitioning, or clustering, problem is one of grouping the stocks into related subsets of stocks that exhibit similar price behaviors. As another example, molecular biologists use large molecular arrays to monitor the expressions of genes in organisms over time and in response to various biological perturbations. One object in such studies is to identify groups, or clusters, of genes that all have similar expression patterns. Often, indications of the function of a gene product can be gleaned from determining that the expression of the corresponding gene is similar to the expression of a known gene. For example, an unknown gene that is always expressed immediately following expression of the p53 gene may indicate that the unknown gene product is somehow related to apoptosis.
The general class of problems exemplified in the previous paragraph is referred to as cluster analysis. The goal of cluster analysis is to partition entities into groups, called clusters, so that clusters are homogeneous and well-separated. There is an extensive literature on cluster analysis going back over two decades, including, the following three titles: (1) R. O. Duda and P. E. Hart, Pattern classification and scene analysis, Wiley-interscience, NY, 1973; (2) B. Everitt, Cluster Analysis, Edward Arnold, London, Third Edition, 1993; and (3) B. Mirkin, Mathematical classification and clustering, Kluwer Academic Publishers, 1996. There are many different approaches to defining desirable solutions to cluster analysis and for interpreting those solutions, and there are many different types of clustering that may be identified by clustering analysis. Most formulations of the problem yield NP hard problems. Therefore, many of the approaches emphasize heuristics and approximation. Many of the approaches to cluster analysis, particularly in the field of clustering gene expression patterns, utilize hierarchical methods in which phylogenetic trees are constructed using Euclidean distance metrics for evaluating the relatedness of the different expression patterns of various genes. Euclidean distance metrics are but a small subset of relatedness metrics that might be employed in clustering data, but clustering methods often depend on using a particular type of metric. In many of these approaches, prior assumptions concerning the nature of underlying clustering within the data are required in order to constrain a search for clusters. Many of these methods may often converge on local minima, rather than identifying the most optimal clustering patterns within the data according to some predefined measure of optimality.
Scientists, economists, and data analysts have therefore recognized the need for a method and system that can be applied to data in order to partition the data into related subsets, where the relatedness of the data can be specified by arbitrary methods. In addition, the need for an efficient method for identifying clustering within data that does not rely on prior assumptions about the data, including such things as the maximum number of clusters, a preferred cluster size, and other such constraints, has been recognized. Moreover, scientists, economists, and data analysts have recognized the need for an algorithm that has a high probability of determining an optimal or near-optimal partitioning of data into related data sets, rather than too quickly converging on less-than-optimal partitionings.
The present invention provides a method and system for partitioning data into related data subsets. In one embodiment of the present invention, the method and system takes,as inputs, a data set, a similarity matrix that specifies the relatedness of the data points, or entities within the data set, and a cutoff value that divides relatedness values into low affinity relatedness values and high affinity relatedness values. The method and system iteratively constructs successive clusters of related data subsets until all data points, or entities, are assigned to a cluster. Initially, all the data points, or entities, are unassigned. Unassigned data points are candidates for assignment to a cluster that is being constructed in a given iterative step. During each iterative step, data points assigned to the currently constructed cluster may be removed from the cluster and returned to the pool of candidates. During each iterative step, the method and system may alternate between choosing high affinity candidates and assigning them to the currently constructed cluster, and removing data points from the currently constructed cluster and returning the removed data points to the candidate pool.
Because the relatedness criteria are input as a similarity matrix, the method and system representing this embodiment of the present invention makes no assumptions or reliance on the metrics employed to generate the relatedness criteria. This embodiment of the present invention does not require specification of any additional constraints, such as preferred cluster sizes or a preferred number of clusters, in order to efficiently and effectively partition the data. Finally, because data points, or entities, may be alternatively added and removed during the construction of a given cluster, the method and system is far less prone to converge on sub-optimal partitionings than currently available systems.