There are many systems of data that require clustering data analysis, specifically clustering data that exhibits a same or similar trend. Some applications call for the clustering of aligned, relatively short, units of data (a sequence of 3-8 values). Often noise within the data is problematic, and increasingly greater volumes of data are analyzed. In general, an efficient and computationally inexpensive algorithm for trend-based clustering of similar data units is desired.
One such application for clustering data analysis comes from microarray technology, which has met with substantial commercial success over the past decade, in part because of its ability to quantify samples with high throughput. Thousands of genes can be examined concurrently under the same conditions. This allows for the identification of groups of co-expressed genes, which may be co-regulated. Genes exhibiting similar responses to triggers are more likely to be controlled by similar regulatory mechanisms. This is often referred to as the “guilt by association” principle [1]. Identifying coherent expression responses is important for identifying co-regulation and for understanding the underlying machinery driving the co-expression. One critical problem is to verify the common regulatory mechanism [1]. Therefore, a commonly repeated step in the analysis of gene expression is to identify those measured transcripts that appear to be correlated to each other. From a computational perspective, this is a clustering problem.
Clustering of co-expressed genes is an active data mining topic that has advanced in parallel with development of microarray technology. There is a vast literature on clustering algorithms developed for microarray data analysis [1-16]. Pioneering works in this area identify full space clusters, but, in many applications, subspace clusters are more meaningful [8]. Biclustering algorithms were recently proposed to find subgroups of genes that exhibit a same behavior across subsets of samples, experimental conditions, or time points [8-12]. Nowadays, it is possible to collect expression levels of a set of genes under a set of biological samples during a series of time points. Such data have three dimensions, gene-sample-time (GST), and thus are called 3D microarray gene expression data [13]. The full space clustering and biclustering concepts do not satisfactorily take advantage of the 3D data collected, and do not fully extract the biological information hidden within the GST data. Triclustering has been proposed to improve data mining of 3D microarray gene expression data.
The prior art recognizes the value for identifying order-preserving clusters as opposed to other trends that are not order-preserving. For example, [12] proposes a technique for identifying order-preserving submatrices (OPSMs) within an n-by-m matrix, where each row corresponds to a gene and each column to an experiment. Their method effectively produces a n-by-m rank matrix, in which the m values in each row are numbers from 1 . . . m. The (i. j) entry of the rank matrix is the rank of the readout of gene i in experiment j, out of the m readouts of this gene. Each row is therefore an example of an order-preserving abbreviated characterization of the m-vector of expression values. As is noted in [12], the OPSM problem is NP-complete. The identification of all of the columns for which some subset of the rows exhibit a trend that is order-preserving is computationally very expensive as the size of the matrix grows. Accordingly, [12] discloses a probabilistic model for uncovering a hidden OPSM with a reportedly “very high success rate”.
A 3D cluster consists of a subset of genes that are coherent on a subset of samples along an interval of time-series. Coherent clusters may contain information used to identify phenotypes, associate genes with phenotypes, and identify expression rules. Triclustering was first introduced in [14], and a similar idea was mentioned in [13]. Pioneering works on triclustering algorithm relied on graph-based approaches to mine triclusters. Unfortunately, those methodologies introduce approximations in their design, and these approximations lead to risks that significant triclusters will be missed, especially when the 3D microarray gene expression data dealing with a short time series is used. For example, the algorithm described in [14] mines the maximal triclusters satisfying a constant multiplicative or additive relationship. Such a strict constraint considerably limits the capability of an algorithm to identify some useful patterns and may not be able to fully cope with noise when dealing with short time-series or even in general time-series gene expression data. The triclustering algorithms developed in [15-16] are equally problematic in this regard.
Applicant is aware of documents directed to methods and algorithms for analysis of microarray gene expression data, including: U.S. Pat. Nos. 6,965,831; 2005/0240357; 7,043,500; 2003/0129660; 2003/0224344; 2005/0240563; 2008/0027954; WO01/73602 U.S. Pat. No. 7,174,344; and U.S. Pat. No. 7,386,523. None of these deals with the identification of order preserving patterns, and none of them deals with the clustering of 3D short time-series gene expression data. US2003/0224344 uses probabilistic modeling of the data and graph theoretic techniques to identify subsets of genes that jointly respond across a subset of attributes. The clustering algorithms filed in U.S. Pat. Nos. 7,043,500; 7,043,500; 2003/0129660; 2003/0212702; 2005/0240357; 2005/0240563; 2008/0027954; 7,174,344; 7,386,523, and WO01/073602 are more similar to traditional clustering algorithms, that are very similar to the classical K-means clustering algorithm. Some of these (U.S. Pat. No. 7,174,344; and U.S. Pat. No. 7,386,523) do not appear to apply specifically to analysis of gene expression data.
A coupled two-way clustering (CTWC) algorithm is described in U.S. Pat. No. 2005/0240357 and U.S. Pat. No. 6,965,831. The CTWC algorithm defines a generic scheme for transforming a one-dimensional clustering algorithm into a bi-clustering or 2D clustering algorithm. It relies on having a one-dimensional traditional clustering algorithm that discovers significant clusters. Given such an algorithm, the coupled two-way clustering procedure recursively apply the one-dimensional algorithm to sub-matrices, aiming to find subsets of genes giving rise to significant clusters of attributes and subsets of attributes giving rise to significant subsets of genes.
There are also a number of prior art documents related to methods of analysis of gene expression data including data clustering as one of the processing steps. These documents include: U.S. Pat. Nos. 6,263,287; 6,876,930; 6,996,476; 7,010,430; 7,031,844; 7,031,847; 7,127,354; 7,289,911; 2002/0052692; 2002/0115070; 2002/0169560; 2003/0036071; 2004/0128080; 2005/0027460; 2005/10100929; 2005/0130187; 2006/0074566; 2006/0084075; WO 03/072701; WO 2006/087240; and WO 2008/102825.
There are also a number of documents related to applications of clustering of gene expression data. These documents illustrate various, mostly diagnostic, applications of gene clustering based on analysis of gene expression data, usually using DNA microarrays for measuring gene expression levels. These documents include: U.S. Pat. Nos. 7,257,562; 7,308,364; 2004/0009489; 2004/0077020; 2004/0101878; 2004/0162679; 2005/0048535; 2005/0202421; 2006/0078941; 2006/0282916; WO 01/30973; WO 02/059367; and JP 2008/225689.
However, there is still a need for a technique for data analysis, minimally affected by noise, that is able to cluster short sequences of (e.g. 3-20, more preferably 3-8) values according to a trend of the values.