DNA array technologies have made it possible to monitor the expression level of a large number of genetic transcripts at any one time (see, e.g., Schena et al., 1995, Science 270:467-470; Lockhart et al., 1996, Nature Biotechnology 14:1675-1680; Blanchard et al., 1996, Nature Biotechnology 14:1649; Ashby et al., U.S. Pat. No. 5,569,588, issued Oct. 29, 1996). Of the two main formats of DNA arrays, spotted cDNA arrays are prepared by depositing PCR products of cDNA fragments with sizes ranging from about 0.6 to 2.4 kb, from full length cDNAs, ESTs, etc., onto a suitable surface (see, e.g., DeRisi et al., 1996, Nature Genetics 14:457-460; Shalon et al., 1996, Genome Res. 6: 639-645; Schena et al., 1996, Proc. Natl. Acad. Sci. U.S.A. 93: 10614-10619; and Duggan et al., Nature Genetics Supplement 21:10-14). Alternatively, high-density oligonucleotide arrays containing thousands of oligonucleotides complementary to defined sequences, at defined locations on a surface are synthesized in situ on the surface by, for example, photolithographic techniques (see, e.g., Fodor et al., 1991, Science 251:767-773; Pease et al., 1994, Proc. Natl. Acad. Sci. U.S.A. 91:5022-5026; Lockhart et al., 1996, Nature Biotechnology 14:1675; McGall et al., 1996, Proc. Natl. Acad. Sci. U.S.A. 93:13555-13560; U.S. Pat. Nos. 5,578,832; 5,556,752; 5,510,270; and 6,040,138). Methods for generating arrays using inkjet technology for in situ oligonucleotide synthesis are also known in the art (see, e.g., Blanchard, International Patent Publication WO 98/41531, published Sep. 24, 1998; Blanchard et al., 1996, Biosensors and Bioelectronics 11:687-690; Blanchard, 1998, in Synthetic DNA Arrays in Genetic Engineering, Vol. 20, J. K. Setlow, Ed., Plenum Press, New York at pages 111-123). Efforts to further increase the information capacity of DNA arrays range from further reducing feature size on DNA arrays so as to further increase the number of probes in a given surface area to sensitivity- and specificity-based probe design and selection aimed at reducing the number of redundant probes needed for the detection of each target nucleic acid thereby increasing the number of target nucleic acids monitored without increasing probe density (see, e.g., Friend et al., International Publication No. WO 01/05935, published Jan. 25, 2001).
By simultaneously monitoring tens of thousands of genes, DNA array technologies have allowed, inter alia, genome-wide analysis of mRNA expression in a cell or a cell type or any biological sample. Aided by sophisticated data management and analysis methodologies, the transcriptional state of a cell or cell type as well as changes of the transcriptional state in response to external perturbations, including but not limited to drug perturbations, can be characterized on the mRNA level (see, e.g., Stoughton et al., International Publication No. WO 00/39336, published Jul. 6, 2000; Friend et al., International Publication No. WO 00/24936, published May 4, 2000). Applications of such technologies include, for example, identification of genes which are up regulated or down regulated in various physiological states, particularly diseased states. Additional exemplary uses for DNA arrays include the analyses of members of signaling pathways, and the identification of targets for various drugs. See, e.g., Friend and Hartwell, International Publication No. WO 98/38329 (published Sep. 3, 1998); Stoughton, International Publication No. WO 99/66067 (published Dec. 23, 1999); Stoughton and Friend, International Publication No. WO 99/58708 (published Nov. 18, 1999); Friend and Stoughton, International Publication No. WO 99/59037 (published Nov. 18, 1999); Friend et al., U.S. Pat. No. 6,218,122.
Protein microarrays are used to monitor the genome-wide protein expression in cells (i.e., the “proteome,” Goffeau et al., 1996, Science 274:546-567; Gygi et al., 1999, Nature Biotechnology 17:994-999). Protein microarrays contain binding sites comprise immobilized, preferably monoclonal, antibodies specific to a plurality of protein species encoded by the cell genome (see, e.g., Zhu et al., 2001, Science 293:2101-2105; MacBeath et al., 2000, Science 289:1760-63; de Wildt et al., 2000, Nature Biotechnology 18:989-994). Protein expression in a cell can also be separated and measured by two-dimensional gel electrophoresis techniques. Two-dimensional gel electrophoresis is well-known in the art and typically involves iso-electric focusing along a first dimension followed by SDS-PAGE electrophoresis along a second dimension. See, e.g., Hames et al., 1990, Gel Electrophoresis of Proteins: A Practical Approach, IRL Press, New York; Shevchenko et al., 1996, Proc. Natl. Acad. Sci. USA 93: 14440-14445; Sagliocco et al., 1996, Yeast 12:1519-1533; Lander, 1996, Science 274:536-539; and Beaumont et al., Life Science News 7, 2001, Amersham Pharmacia Biotech. The resulting electropherograms can be analyzed by numerous techniques, including mass spectrometric techniques, Western blotting and immunoblot analysis using polyclonal and monoclonal antibodies, and internal and N-terminal micro-sequencing. Using these techniques, it is possible to identify a substantial fraction of all the proteins produced under given physiological conditions, including in cells (e.g., in yeast) exposed to a drug, or in cells modified by, e.g., deletion or over-expression of a specific gene.
Analysis of Variance (ANOVA) method (see, e.g., Statistics For Experimenters, by Box, Hunter and Hunter, John Wiley & Sons, 1978) are used in gene or protein expression data analysis to determine differential expressions under different treatment conditions. In a one-way ANOVA, there is one experimental factor under investigation. For example, the factor may be the effects of several different compounds vs. the vehicle, which is the baseline reference. For example, when effects of a set of drugs is under investigation, the number of compounds is the number of levels of the factor. The goal is to find out from measured data whether a gene or protein is affected by the compounds. If the expression level of the gene or protein is increased or decreased after the treatments, the gene or protein is said differentially expressed. In a two-way ANOVA, there are two factors under investigation, for example, the drug effect and the dosage effect. Each factor may have multiple levels. Interaction between the two factors is also included in the ANOVA analysis.
ANOVA is often used for determining whether there are statistical differences among the means of measurements in different measurement groups. As an example, the different measurement groups may contain measurements of expression levels of a gene or protein under different drug treatments. In each group, there may be several replicate measurements under the same treatment. First, one finds the within-group variance and the between-group variance. The within-group variance is the measurement variance of measurements within a treatment group. The between-group variance is the measurement variance of the means of different treatment groups. The within-group variance reflects the measurement error of the measurement technology, and the between-group variance includes both the measurement error of the measurement technology and changes caused by different treatments. Then the between-group variance is compared to the within-group variance. If the between-group variance is significantly larger than the within-group variance, it may be concluded that the different treatments have produced statistically significant changes in gene expression levels. In ANOVA analysis, the underlying null-hypothesis is that all treatment groups have the same mean. With the estimated mean squares and degrees of freedom, a p-value of F-statistics can be calculated. The p-value is the probability that the null-hypothesis may be accepted. When the p-value is lower than a given threshold, for example p-value<0.01, the null-hypothesis can be rejected and the alternative hypothesis, which means that some of the expression levels have different means, can be accepted. In other words, some treatments have produced changes in the expression level of the gene.
In a traditional ANOVA method, only measurement quantities, for example, the gene expression intensity or ratio, are used to determine the mean squares and degrees of freedom. The traditional ANOVA relies on a large number of measurement replicates to get a reliable estimation of the within-group variance. However, in gene or protein expression studies, limited by the small quantity of samples and the high cost of carrying out the measurements (such as DNA microarray measurements), the number of replicates is often small. The degrees of freedom is are often small. By random chance, due to the small number of replicates, the estimated within-group variance of expression levels of some of genes or proteins can be very small, often much smaller than the actual measurement error inherent in the measurement technology. As a result, the between-group variance can be much larger than the underestimated within-group variance, leading to a small p-value, which in turn incorrectly indicates a statistically significant difference in the expression levels of the gene or protein. Such incorrect identification of a gene or protein is called a “false positive.” High false positive rates are a severe problem in gene expression analysis when the traditional ANOVA method is used. A large number of false positives often requires follow-up validation using other expression profiling technologies. Low degrees-of-freedom also reduce the detection sensitivity. As a result, small changes in differential expression may not be detected. Such missed or detected differentially expressed genes or proteins are called “false negatives.”
Measurement errors in microarray experiments are often described by error models (see, e.g., Supplementary material to Roberts et al, 2000, Science, 287:873-880; and Rocke et al., 2001, J. Computational Biology 8:557-569). Measurement errors can also be described as a sum of a common error and a scatter error (see, e.g., Stoughton et al., U.S. Pat. No. 6,351,712). An error-weighted average is used in combining ratio profiles (see, e.g., Stoughton et al., U.S. Pat. No. 6,351,712).
Various ANOVA models have been described for analyzing microarray data (see, e.g., Kerr et al., 2000, J. Computational Biol. 7:819; Kerr et al., 2001, Genetical Research 77:123; Wolfinger et al., 2001, J. Computational Biol. 8:625; Pritchard et al., 2001, Proc. Natl. Acad. Sci. USA 98:13266; Lonnstedt et al. 2002, Statistica Sinica 12:31; and Wu et al., “MAANOVA: A software package for the Analysis of Spotted cDNA Microarray Experiments,” published on the web). These methods are normally applied to transformed microarray data, e.g., logarithmic transformed data, to decompose microarray data into different terms according to sources of variations, e.g., variations due to arrays, dyes, genes, and interactions thereof. Measured expression level changes due to one or more of such sources, e.g., expression level changes as a result of real changes in gene expression in the cells, can then be determined.
It is therefore desirable to have methods that are more accurate in determining differences in measured data among different perturbation groups. It is desirable to have methods for analyzing gene or protein expression with improved false-positive and/or false-negative rate.
Discussion or citation of a reference herein shall not be construed as an admission that such reference is prior art to the present invention.