DNA microarray technologies have enabled the expression levels of thousands of genes during various cellular processes to be monitored simultaneously [1, 2]. In a typical experiment expression levels of thousands of genes are recorded over a few tens of different samples [3, 5, 6]. By “sample”, it is meant any kind of living matter that is being tested, such as different tissues [3], cell populations collected at different times [4] and so forth. Hence arrays that contain 105-106 measurements must be analyzed, thereby giving rise to a new computational challenge: to make sense of such massive amounts of expression data [7, 8].
The aims of such analyses are typically to (a) identify cellular processes which affect the gene expression pattern; (b) search for different phases of these processes, by grouping the samples into clusters which share an expression pattern; (c) find genes which differentiate between these clusters, and hence take part in the relevant biological process and (d) explain the role these genes play in the process.
The sizes of the datasets and their complexity call for multi-variant clustering techniques which are essential for extracting correlated patterns from the swarm of data points in multidimensional space. The aim of clustering is to identify the natural classes present in a set of N data points, or objects, each one represented by means of D different measured features. That is, the data can be viewed as N points in a D dimensional space. The aim of clustering algorithms is to reveal the structure of this cloud of points, for example, to determine whether the data consists of a single cloud or several clouds, or whether the constituent components have any internal structure, revealed when the data are viewed with higher resolution. Under most circumstances the data points must be partitioned into clusters; it makes no sense to try and divide the features which characterize the data points into classes.
The situation with gene microarray data is different, in that clustering analysis can be performed in two ways. The first views the ns samples as the N objects to be clustered, with the ng genes' levels of expression in a particular sample playing the role of the features, that represent that sample as a point in a D=ng dimensional space. The different phases of a cellular process emerge from grouping together samples with similar or related expression profiles. The other, not less natural way looks for clusters of genes that act correlatively on the different samples. This view considers the N=ng genes as the objects to be clustered, each represented by its expression profile, as measured over all the samples, as a point in a D=ns dimensional space.
Gene microarray data are special in that both ways of looking at them have meaning and are of interest. Having realized this, Eisen et al and Alon et al applied such two-way clustering to data from experiments on yeast cell cycle [4] and colon cancer [3]. However, they clustered first the samples and then the genes completely independently, with no coupling at all between the two clustering procedures. In principle the two clustering operations could have been carried out in different places at different times; the results of one operation were not allowed to affect the other.
The current approach in the literature is to cluster the samples on the basis of as many genes as possible (usually the number used is limited by eliminating samples with the weakest signals). Similarly, when clustering genes, there is a tendency to rely on features accumulated from as many samples (even taken from different experiments! [4]) as possible. The philosophy behind this approach may be termed “holistic”, as it attempts to extract information from the larger, overall, complete picture.
However, this approach clearly has a number of disadvantages. First, large amounts of data must be analyzed, which may require extensive resources, whether in human work hours, computational power or experimental procedures. Second, the signal-to-noise ratio may be quite poor with this approach, given the emphasis on analyzing the overall picture. Third, the actual points of interest may be obscured in the larger sets of data to analyze. All of these drawbacks clearly render currently available clustering techniques both less effective and less robust.