Living things have unique genomes that are characterized by differences in the copy number of genes or in levels of transcription of protein agents from genes within their cellular environments. By detection and analysis of such genomic data, insight can be gained into the unique genetic makeup of such living things. In particular, the detection and analysis of differential gene expression in human subjects as indicators of the causes of or propensities toward certain diseases can lead to the discovery of genetic markers for those diseases and the development of pharmaceuticals and treatments to cure such diseases.
Although in principle the human genome contains all the information one would need to determine the genetic makeup of a person and the causes or disposition of that person to certain diseases, in actuality the genetic information is encoded in complex ways and is subject to complex stages of transcription and change into final protein agents, which are themselves subject to intermediation by multiple physical and biological constructs that have yet to be fully understood. It is generally understood that the human genome consists of some 22,000 to 24,000 genes, which, through complex transcription sequences and interactions can result in a currently estimated 2,000,000 different protein agents in the body. The combined effect of genome-level variations and physical and biological interactions makes it difficult to predict the expression of any particular gene or gene grouping, and even more difficult to predict the result of a final transcription of a protein agent and its effect in the body. The genome-centric view of molecular biology of the recent past is thus evolving into a more comprehensive “systems” view in which comprehensive information at the various transcriptional levels from genes to proteins to effects in the body are gathered in dispersed databases and harnessed together in a pipeline from research to discovery. The ideal vision for developing and harnessing such comprehensive knowledge is ultimately “personalized medicine”, in which a complete description of a person's clinical symptoms, medical history, gene-expression data, metabolic parameters, and treatment results can lead to finding the causes and the cures for any diseases experienced by that person. For a more complete overview of this subject, reference is made to Augen, J., “Bioinformatics in the Post-Genomic Era”, Addison Wesley, 2004.
The development of a complete molecular-level picture of health and disease involves understanding metabolic processes at four distinct levels: genome (DNA sequence), transcriptome (messenger RNA profile), proteome (protein structure and interactions), and metabolic pathway (action and effect in the body). Messenger RNA (or mRNA) profiling technologies are emerging as standard tools for research and clinical use in all these levels. It has become apparent that the transcription expression of large numbers of genes (if not the entire genome) needs to be determined in parallel to achieve an understanding of complex metabolic events. For example, it is estimated that as many as 10% of the 10,000 to 20,000 mRNA species in a typical mammalian cell are differently expressed between cancerous and normal tissues.
In simplistic terms, a gene impacts biological function by transcribing its DNA sequence into messenger RNA (mRNA), which in turn is translated into a corresponding protein. Proteins are the molecular components that directly control all biological systems. A gene is “expressed” if it is actively transcribing mRNA. The DNA hybridization array (also referred to as microarrays, expression arrays, or gene chips) has become the dominant technology for quantifying the expression of many genes in parallel because of its low cost and flexibility. DNA microarrays allow a biologist to obtain a snap-shot of the expression levels of a large set or all of the genes of a subject genome in a given tissue sample at a given point in time. Using DNA chips, patterns of global gene expression can be compared between normal and abnormal tissue samples to detect genes that are significantly changed in the abnormal condition. Gene expression data analysis can also be used to determine how genes change their expression in a single tissue sample over time. Detecting the set of all “differentially expressed” genes over space and time is an essential first step toward a comprehensive understanding of the pathobiology of many common diseases. Such understanding can lead to new diagnostic and prognostic applications and novel therapeutic interventions and drugs.
A DNA microarray is constructed with thousands of gene sequence fragments encoded as spots or points on a substrate, including known or predicted variants and potential polymorphisms to support a large number of cross comparisons of expression data. mRNA is harvested from selected cells in treated or symptomatic subjects as well as from control or untreated subjects, and reverse transcribed into more stable, complementary DNA (cDNA) added with fluorescent labels, green for cDNA derived from treated cells, and red for cDNA from untreated cells. The samples of fluorescent labeled cDNA are applied to the microarray and exposed to every spot. A sequence match results in binding between the cDNA test sequence and a complementary DNA sequence on the array (hybridization) resulting in fluorescent labeling of the spot. A laser fluorescent scanner is used to detect the hybridization signals from both fluorophores, and the resulting pattern of colored spots is stored in a database: green for strongly expressed genes in the treated sample, red for strongly expressed genes in the untreated sample, and black for sequences that are not expressed in either sample. Because the sequence of every spot in the chip is known, the identity of each expressed cDNA sequence can be determined, and the relative amount and source (treated or untreated sample) can be inferred from the color and intensity of the spot.
Microarrays with probes of various types can be employed for testing different expression patterns in a study's subjects. One type of microarray is the oligonucleotide microarray, such as the Gene Chip™ microarray offered by Affymetrix Corporation, of Palo Alto, Calif. For measuring gene expression levels using oligonucleotide microarrays, expected RNA transcripts of target genes can be measured by probes which are perfectly complementary to the target sequences (referred to as perfect match PM probes). Probes may also be provided whose sequences are deliberately selected not to match the target sequences (referred to as mismatched MM control probes). Since sequences that are different from the target sequences may also bind to the PM probes that correspond to particular target sequences, the fluorescence signals from such sequences would appear as noise. Signal-to-noise ratio can be improved by calculating the difference between signals from sequences that bind to PM probes and signals from sequences that bind to MM probes.
Due to the large amounts of data generated for entire genomic sets, for example, about 22,000 to 24,000 genes in the case of human beings, advanced quantitative methods are needed to determine whether detected differences in gene expression in microarray probes are experimentally significant. To point the way toward possible discovery and treatment of diseases indicated in small groups of human subjects, perhaps even an individual test subject, it would be desirable to have a method which can analyze a relatively small number of samples and provide a measure of acceptable statistical confidence in the detection of a particular gene expression pattern in the small test group.
Prior methods of small-group detection have been based on conventional t-tests to provide a probabilistic assessment that a detected gene expression pattern is significant, as opposed to a false positive detected randomly in noise. In conventional t tests, a false positive probability of 1% (p=0.01) may be deemed significant in experiments involving a small number of genes, however, in a microarray experiment for 10,000 genes, a 1% false positive probability would identify 100 genes expressed by chance as significant. Moreover, the amount and type of unwanted variations in DNA chip data makes discriminating true differential expression from noise a difficult task. Finally, the small number of DNA chip samples in any given study relative to the large number of genes renders classical statistical methods ineffective and error-prone.
One approach for further differentiating statistical significance of microarray data from false positives is known as the “fold change” method. In this method, a number of genetic samples are deliberately subjected to a physical change, such as a chemical reaction or physical manipulation or exposure (e.g., radiation), and their gene expression is compared to other samples that have not been subjected to such physical change. The “fold change” method is used to identify gene expression differences deemed significant in the samples subjected to the physical change compared to the samples not subjected to the physical change above a determined threshold. However, the “fold change” method is limited by the types of physical changes that can be employed corresponding to particular diseases or risk propensities being tested, and can also yield unacceptably high false discovery rates. Some attempts to improve on the “fold change” method, such as observing a fold change consistently between paired samples, is still limited and can yield an unacceptably high false discovery rate. See, Quackenbush J., Microarray data normalization and transformation, (2002) Nature Genetics Supplement, Vol. 32, 496-501.
As also noted above, conventional techniques analyze differences in gene expression levels that are both positive (up-regulated) and negative (down-regulated), so that negative expression values are possible during analysis. A standard method of “visualizing” both up-regulated and down-regulated genes between two (2) biological conditions plots fold change against the geometric average of expression in log-log scale. The resulting Ratio-Intensity (RI) plot displays differentially expressed genes on the periphery of the data cloud. Moreover, a typical RI scatter plot shows that variation of fold change is dependent on intensity, a situation which complicates statistical analysis and interpretation of results.
Another method of statistical analysis of gene expression data is the so-called “Significance Analysis of Microarrays” (SAM), for example, as disclosed in U.S. patent application 2002/019,704 of Tusher, V., Tibshirani, R., and Chu, C., published on Feb. 14, 2002. This method identifies genes with statistically significant differences in expression by assigning each gene a modified t-score representing differences in gene expression relative to a standard deviation of repeated measurements for that gene. Genes with absolute scores greater than an adjustable threshold are deemed potentially significant. A smoothing factor incorporated into the modified t-score renders the resulting analysis substantially independent of the ranges of values that characterize the genes. A confidence measure known as the false discovery rate (FDR) is used to assess the statistical significance of the collection of genes called significant by SAM. FDR is defined as the expected proportion of false positives among all genes called significant. The goal is to obtain a reasonably large list of significant genes with acceptably small FDR. A major feature of FDR as implemented in SAM is the automatic accounting for bias introduced by multiple testing of thousands of genes at once (i.e., multiple testing problem). Unlike the standard Bonferroni adjustment for multiple testing, FDR maintains sensitivity for differential expression without sacrificing specificity. FDR also allows genomic researchers to assess the risk of allocating more time and resources to a specific gene or group of genes. Finally, SAM uses permutation testing to estimate FDR for a given set of significant genes, thus precluding the need for distributional assumptions about the data under the null hypothesis of no differential expression. As a result, SAM analysis of microarray experiments involving small numbers of DNA chips are problematic since the number of permutations will be insufficient to accurately estimate the true FDR.