The use of computers in the modern world has provided people with an excess of data. Spotting patterns and trends in this data is important if much use is to be made of that data. It is also made difficult by the very quantity of data to be analyzed. Seeking such useful information amongst the data is often referred to as data mining and it is performed usefully in such disparate areas as biotechnology (e.g. DNA experiments), chemical reaction and chemical process development and the finance industry (e.g. consumer spending, foreign exchange rates, and stock market data).
The present invention was particularly developed with microarray data analysis in mind, but is also applicable in searching for patterns in other types of data.
In DNA experiments, a number of genes are exposed to a series of experimental conditions or to one set of experimental conditions over a length of time, with gene expression data derived for each experimental condition or time. FIG. 1 schematically shows a typical approach which is to use M different microarrays Array 1, Array 2, . . . , Array M, each of the same set of N genes, each microarray representing the set of gene expression data for a particular experimental condition 1, 2, . . . , M, time period t1, t2, . . . , tM, or other condition. These different conditions give rise to different samples 1 to M.
Results from sets of microarrays are often provided in N×M data matrices of standardized expression levels, for instance as shown in FIG. 2(a). The rows represent the results for the individual genes and the columns the results for the individual samples. The standardized expression level eij is the standardized expression level of gene i of sample j.
The standardized expression levels are determined from the actual expression levels in the samples by any one of a number of known ways, e.g. using the ratio of the data with the expression level for the same gene in a control or using the log of such ratios, using the ratio of the data to the sum of the data and the expression level for the same gene in a control, using the difference between data and the expression level for the same gene in a control, or any of a number of other known methods. In the examples presented herein, the standardization that has been used is:
            e      ij        =          log      ⁢                                                  R              _                        ij            feature                    -                                    R              _                        ij            background                                                              G              _                        ij            feature                    -                                    G              _                        ij            background                                ,where:     Rijfeature and Gijfeature are, respectively, the average red (cy5 dyes) and green (cy3 dyes) intensity levels of the data at point ij in a number of nominally identical and identically processed arrays;     Rijbackground and Gijbackground are, respectively, the average red (cy5 dyes) and green (cy3 dyes) intensity levels at the same point ij computed from a background area or from a number of nominally identical and identically processed control arrays after the same processing;
The expression level matrix of FIG. 2(a) is often converted into a visual array, of varying levels of red (for larger eij) and green (for smaller eij) and mixtures thereof. A black and white print out of such an array is shown in FIG. 2(b).
A key step in the analysis of gene expression data is to discover groups of genes that share similar transcriptional characteristics. Clustering gene expression data into homogeneous groups is instrumental in functional annotation, tissue classification and motif identification. However, standard clustering methods, such as:                k-means (for instance as described in Tavazoie S, Hughes J D, Campbell M J, Cho R J, Church G M: Systematic determination of genetic network architecture. Nat Genet 1999, 22:281-285);        hierarchical clustering algorithms (for instance as described in Eisen M B, Spellman P T, Brown P O, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95:14863-14868); and        self-organizing maps (for instance as described in Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E S, Golub T R: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 1999, 96:2907-2912),have their limitations: they require that the related genes behave similarly across all measured samples. However, in many situations, an interesting cellular process is active only in a subset of the samples, or a single gene may participate in multiple pathways that may or not be co-active under all samples. Also, when the data to be analyzed include many heterogeneous samples from many experiments, a clustering algorithm often cannot produce a satisfactory solution. To overcome such difficulties, biclustering is often used.        
In gene expression data, a bicluster is a subset of genes exhibiting a consistent pattern over a subset of samples [Cheng, Y. and Church, G. M. (2000) Biclustering of expression data. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB), 93-103]. This means that biclustering performs clustering in the row and column dimensions simultaneously (when applied to a matrix such as expression level matrix of FIG. 2(a)). There are a number of different bicluster patterns that are useful for gene expression data analysis, such as constant values, constant rows or columns and coherent values.
Most existing biclustering algorithms work by making permutations of the data matrix and detecting sub-matrices within the data matrix, such that a merit function is optimized. A comprehensive survey [in Madeira, S. C., and Oliveira, A. L. (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Computational Biology Bioinformatics, 1, 24-45] points out that different biclustering algorithms iteratively search for the best possible subgrouping of the data using data mining techniques. The general strategy in all these algorithms can be described as adding or deleting rows and/or columns in the data matrix in some optimal ways such that an appropriate merit function is improved by the action.
The above-mentioned Madeira and Oliveira review of recent literature on biclustering indicates that there are several classes of biclusters. Three major classes of these are:
(i) biclusters with constant values;
(ii) biclusters with constant values in rows or columns; and
(iii) biclusters with coherent values in rows or columns.
FIGS. 3(a) to 3(f) show several different types of biclusters:
FIG. 3(a) constant bicluster;
FIG. 3(b) constant rows;
FIG. 3(c) constant columns;
FIG. 3(d) coherent values with additive model, where each row or column can be obtained by adding a constant to another row or column;
FIG. 3(e) coherent values with multiplicative model, where each row or column can be obtained by multiplying another row or column by a constant value; and
FIG. 3(f) coherent values on columns with linear model, where each column can be obtained by multiplying another column by a constant value and then adding a constant.
The pattern in FIG. 3(f) is most general here and all other patterns, of FIGS. 3(a) to 3(e) can be regarded as special cases of this general pattern.
The Madeira and Oliveira review classified existing biclustering algorithms according to specific patterns the algorithms can detect. For example, the Double Conjugated Clustering (DCC) and block clustering algorithms are designed to detect constant values (FIG. 3(a)). The Coupled Two-Way Clustering (CTWC) and Gibbs algorithm focus on biclusters of the constant rows or columns (FIG. 3(b) or 3(c)). Segal, E., Taskar, B., Gasch, A., Friedman, N. and Koller, D. (2001) Rich probabilistic models for gene expression. Bioinformatics, 17, 243-252, assumes the additive model (FIG. 3(d)) in its algorithm and Kluger, Y., Basri, R., Chang, J. T., and Gerstein, M. (2003) Spectral biclustering of microarray data: co-clustering genes and samples. Genome Research, 13, 703-716, develops an algorithm for the multiplicative model (FIG. 3(e)).
In these methods, the type of patterns to be detected depends on the merit function used. Although it is possible to transform a bicluster of different type (i.e., constant rows) into a reference type (such as constant values bicluster), the necessary transformation is not known a priori for that bicluster. Determination of the appropriate transformation is further complicated by the presence of noise in the data.