Clustering and gene selection from microarray gene expressions data have gained tremendous importance since they help in identifying the genes that play pivotal roles in specified biological conditions that lead to diseased states.
In a microarray sampling, RNA samples are hybridized to known cDNAs/oligo probes on the arrays. Normally either spotted microarrays or oligonucleotide microarrays are used, and the user can chose the probes to be spotted according to the specific experimental needs. In oligonucleotide microarrays, short DNA oligonucleotides that match parts of sequence of known mRNA are spotted onto the array. They give estimation of absolute values of gene expression.
Clustering algorithms operating on microarray gene expression data can help in identifying co-regulated and co-expressed genes under certain specified conditions. Thereby help in identifying genes that can classify test samples into diseased and normal classes. Many clustering algorithms have been developed, including the K-means algorithm, the self-organizing map (SOM) algorithm, hierarchical cluster algorithms, bi-cluster algorithms, etc. Clustering algorithms use the differences of the gene expression values to cluster the genes. The difference is expressed in the terms of a distance measure, and conventionally Eucledian distance and Pearson's correlation coefficient is used to calculate similarity between two genes. However, these types of distance measures have some limitations relating to similarities in profile shape, sensitivity to outliers, moreover the number of clusters has to be specified initially.
At least some of these limitations are met by the method called Attribute Clustering Algorithm (ACA) as published by Au et al. in IEEE/ACM Transactions on Computational Biology and Bioinformatics 2(2), p. 83-101 (2005). The ACA essentially used the K-means algorithm concept. However, the distance measure employed is an information theoretic measure, as so-called interdependence redundancy measure, that takes into account interdependence between genes.
Another type of algorithm used in bioinformatics, is the Genetic Algorithm (GA), which is a search technique used to find true or approximate solutions to optimization problems. The genetic algorithm starts from a population of randomly generated individuals, i.e. possible solutions, and proceeds in successive generations in finding better solutions. In each generation, every individual in the population is modified to form a new individual. The algorithm is an iterative algorithm, that terminates after a maximum number of generations or when a generation fulfils a given fitness criteria.
While a number of methods have been found to help in identifying candidate genes that can be used as classifiers for given biological conditions, there is still a need in the art to find alternative solutions to further gain insight into the complexity of understanding biological conditions based on gene data.