Knowledge discovery is the most desirable end product of data collection. Recent advancements in database technology have lead to an explosive growth in systems and methods for generating, collecting and storing vast amounts of data. While database technology enables efficient collection and storage of large data sets, the challenge of facilitating human comprehension of the information in this data is growing ever more difficult. With many existing techniques the problem has become unapproachable. Thus, there remains a need for a new generation of automated knowledge discovery tools.
As a specific example, the Human Genome Project has completed sequencing of the human genome. The complete sequence contains a staggering amount of data, with approximately 31,500 genes in the whole genome. The amount of data relevant to the genome must then be multiplied when considering comparative and other analyses that are needed in order to make use of the sequence data. As an illustration, human chromosome 20 alone comprises nearly 60 million base pairs. Several disease-causing genes have been mapped to chromosome 20 including various autoimmune diseases, certain neurological diseases, type 2 diabetes, several forms of cancer, and more, such that considerable information can be associated with this sequence alone.
One of the more recent advances in determining the functioning parameters of biological systems is the analysis of correlation of genomic information with protein functioning to elucidate the relationship between gene expression, protein function and interaction, and disease states or progression. Proteomics is the study of the group of proteins encoded and regulated by a genome. Genomic activation or expression does not always mean direct changes in protein production levels or activity. Alternative processing of mRNA or post-transcriptional or post-translational regulatory mechanisms may cause the activity of one gene to result in multiple proteins, all of which are slightly different with different migration patterns and biological activities. The human proteome is believed to be 50 to 100 times larger than the human genome. Currently, there are no methods, systems or devices for adequately analyzing the data generated by such biological investigations into the genome and proteome.
Clustering is a widely used approach for exploratory data analysis. Clustering analysis is unsupervised learning—it is done without suggestion from an external supervisor; classes and training examples are not given a priori. The objective of clustering is to group data points into “meaningful” subsets. There is no agreed upon definition of the clustering problem, and various definitions appear in the literature. For example, clustering has been defined as a search for some “natural” or “inherent” grouping of the data. However, most clustering algorithms do not address this problem. The vast majority of clustering algorithms produce as their output either a dendogram or a partition into a number of clusters, where the number of clusters is either the input, or there is some other parameter(s) that controls the number of clusters. In either case, a model selection technique is required in order to choose the model parameter, or in the case of hierarchical algorithms, to determine which level of the dendogram represents the “inherent” structure of the data.
A few examples of applications of clustering include (1) analysis of microarray data, where co-expressed genes are found, and the assumption is that co-expression might be a sign of co-regulation; (2) in medical datasets (gene expression data, clinical data etc.), where patients are divided into categories; (3) in any set or set of measurements to detect trends or artifacts in the measurement protocol; and (4) in information retrieval to partition text according to categories.
Most clustering algorithms either produce a hierarchical partitioning of the data into smaller and smaller clusters, or produces a partition of a dataset into a number of clusters that depend on some input parameter (the number of clusters or some other parameter(s)). The question remains, however, of how to set the input parameter, or how to determine which level of the tree representation of the data to look at: Clustering algorithms are unsophisticated in that they provide no insight into the level of granularity at which the “meaningful” clusters might be found. Occasionally, there may be prior knowledge about the domain that facilitates making such a choice. However, even in such cases, a method for determining the granularity at which to look at the data is required. This is seen as the problem of finding the optimal number of clusters in the data, relative to some clustering algorithm.
E. Levine and E. Domany in “Resampling Method for Unsupervised Estimation of Cluster Validity”, Neural Comp. 13, 2573-2593 (2001), assign a figure of merit to a clustering solution according to its similarity to clusterings of subsamples of the data. The “temperature” parameter of their clustering algorithm is selected according to a maximum of the similarity measure. However, in real data, such a maximum does not often occur. Other model selection techniques have difficulty detecting the absence of structure in the data, i.e., that there is a single cluster. Further, many of algorithms make assumptions as to cluster shape, and do not perform well on real data, where the cluster shape is generally not known. Accordingly, other methods for clustering are needed. The present invention is directed to such a method.