In the year 2001 the first draft of the sequence of the human genome was completed. As of November 2003, over 164 complete genomes had been published, including mouse, fruit fly, pufferfish and yeast. This wealth of knowledge provides researchers with fundamental tools for preventing or treating diseases, which in many cases are considered to be caused or exacerbated by the simultaneous action of different genes.
In traditional genomic study, a single gene was studied at a time. However, genomic research is nowadays directed towards the development of technologies that allow a parallel analysis of thousands of genes at a time.
The so-called “gene-chips” or “DNA-chips” are extraordinary tools for studying patterns of gene expression. Gene-chips are large arrays of nucleic acid probes, arranged in matrix format on a surface such as a microscope slide. Gene-chips can contain hundreds to hundreds-of-thousands of such probes. Thus, the devices are called “arrays” or “microarrays,” and with future advances in array printing even “nanoarrays” will become practical.
These recent advances allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. Analysis of data produced by such experiments offers potential insight into gene function and regulatory mechanisms. However, a key step in the analysis is the detection of groups of genes that have similar expression patterns. The corresponding algorithmic problem is to group or “cluster” gene expression patterns and correlate these with a variety of different parameters, such as time, drug response, disease status, patient, and the like.
Modern data mining technology can handle all three primary learning tasks: classification, regression, and clustering. However, clustering is used most commonly in the data mining of genomic information and several clustering algorithms are available.
In the general parlance of bioinformatics a clustering problem consists of n elements and a characteristic vector for each element. A measure of similarity is defined between pairs of such vectors. In gene expression, elements will be genes, the vector of each gene will contain its expression levels under each of the conditions, and similarity can be measured, for example, by correlation coefficient between vectors. The goal is to partition the elements into subsets, which are called clusters, so that two criteria are satisfied: 1) Homogeneity—whereby elements inside a cluster are highly similar to each other; and 2) Separation—whereby elements from different clusters have low similarity to each other.
There is a very rich literature on cluster analysis going back over three decades. Several algorithmic techniques have been used in clustering gene expression data, including hierarchical clustering, such as Cluster Identification via Connectivity Kernals (CLICK) or the divisive hierarchical algorithm called DIANA, model based approaches such as the Beyesion Infinite Mixture Model (IMM), and mixed approaches such as a finite Gaussian mixture model-based hierarchical clustering algorithm from MCLUST. There are also iterative approaches such as k-means and Cluster Affinity Search Technique (CAST). There are other approaches such as simulated annealing, self organizing maps (SOM), and graph theoretic approaches. There are even several publicly available software packages for clustal analysis, including MCLUST, Vera and SAM, KNNimpute, dCHIP and the BioConductor project.
However, a limit of the known data mining techniques is that it is not possible to identify groups or sequences of genes by simultaneously applying a plurality of properly weighed criteria for grouping according to gene expression with time and a variety of specific properties of particular interest.