Recent advances in the biological sciences have made it possible to measure thousands of gene expression levels in cells using microarrays, and to quantify the cellular content of thousands of types of proteins using e.g., mass spectrometry, in single experiments with one sample of tissue or serum. Gene expression molecules, i.e., messenger RNA, and the proteins encoded by the mRNA comprise the “transcriptome” of the cell, the components of the cell that are the direct products of the transcription (the genome) and translation (the proteome) of DNA. Analysis of the transcriptome offers the potential to identify “bad actors” in the genetic expression process that are responsible for disease, and may offer both a signature for disease diagnosis, as well as targets for disease amelioration, whether through drugs or other means. By measuring thousands of components at once, the pace of biomarker discovery can be greatly accelerated, and the interactions between multiple components can be analyzed for complex disease mechanisms.
Genomic expression data representing semi-quantitatively the amount of mRNA present in a cell by on a gene-by-gene basis, may reveal markers for disease as well as patterns of expression that effectively differentiate or classify disease states and progression. Similarly, quantification of the count of proteins in the cell by protein molecule type, provides further insight into the overall regulation of the cell from DNA to protein product, and provides key insight into “post-transcriptional modifications” whereby protein structures are modified after translation from RNA. Pattern matching unknown patient samples against known microarray expression level patterns of disease state, may provide diagnosis or prognosis. Disease differentiation is key to applying the appropriate treatment. Cancers may manifest as seemingly similar phenotypes in identical tissue, yet have very different prognoses and treatment outcomes, because the biological pathway of disease is in fact different at the genetic level. It is essential to be able to identify what form of disease a patient has. Another useful application utilizing this kind of data is classifying patients as to whether they are candidates or not for the use of a drug which may, in a small subpopulation, have adverse effects related to the genetic makeup of the individual.
This new wellspring of genomic and proteomic data presents many challenges to discovery of the important information buried therein. One of the challenges of this data is the high dimensionality and the relative scarcity of patient observations. Often, thousands of gene expression levels are measured by a single microarray applied to patient tissue or a cell culture, but the total count of patients may be less than 20. Each such patient represents an observation of a vector comprising these thousands of expression levels. Another challenge is that the data can be extremely noisy, confounding complex analysis approaches that are sensitive to bad input. Yet another challenge is that the new measurement technologies are very sensitive to methodology, resulting in vastly different results when prepared by different researchers. Microarrays themselves have yet to be standardized, and microarrays from different manufacturers often have different oligonucleotide probes for the same gene, introducing variability due to oligo length or affinity. Yet another challenge is that the data often represents a mere snapshot of transcriptomic content, which is in a constant state of flux, as cellular genetic expression changes to adjust to the cell's needs and environment.
A number of methods are known for analysis of genomic and proteomic data. According to one known method, genetic expression levels for genes are compared across samples of normal tissue and diseased tissue, and where differential expression is sufficiently large, the gene is identified as a possible biomarker of the disease. This method is intended to identify gross amplification of a gene, or the complete silencing of a gene. According to known clustering methods, pair-wise correlations of the expression levels of pairs of genes are calculated, and genes are arranged in clusters according to a ranking of their pair-wise correlation across a number of samples. According to an approach that uses Support Vector Machines (SVMs), an optimal separation boundary with maximum margin (or maximum soft margin) is generated (possibly in higher dimensional space) for multivariate expression levels from two classes (normal and diseased), thus providing a classifier for future patterns.
However, these methods have their shortcomings. Gross differential expression analysis assumes static expression levels, and is easily confounded by expression dynamics. Furthermore, many diseases may be the result of more subtle deviations in expression than is typically tolerated by differential expression studies. Clustering only takes advantage of correlative information between pairs of genes, and thus misses more complex interactions and multi-gene dynamics. SVMs can be very computationally intensive, and produce optimal solutions that may overfit the typically small sample size, and as a consequence do not generalize well.
What are needed are improved analytic methods capable of tapping into the dynamic information in multivariate transcriptomic data, to reveal more subtle signatures for classification and data mining of biomarkers. Further, what are needed are faster methods that can be used interactively by discovery personnel, to provide leverage to extensive expertise in the exploration and discovery process. Further, methods are needed that are robust to the noisy nature of the transcriptome data, so that information can be derived from data from today's level of accuracy in microarray and mass spec measurement technologies.