This invention relates in general to statistical analysis of gene related data and, in particular, to analysis of microarray data for identifying genes that exhibit statistically significant behavior.
Different biological systems are characterized by differences in the copy number of genes or in levels of transcription of particular genes. By measuring such biological phenomena, insight into and possible treatment of human diseases may be found.
Microarrays of various types have been employed for measuring the expression levels of large numbers of genes. One type of microarray is the oligonucleotide microarray, one example of which is the Gene Chip® microarray manufactured by Affymetrix corporation of California. International Patent Application PCT/US96/14839, which is incorporated herein in its entirety, describes a method for measuring gene expression levels using oligonucleotide microarrays. In the method described, a nucleic acid sample is hybridized to a high density array of oligonucleotide probes immobilized to a surface, where the high density array contains oligonucleotide-type probes complementary to sequences of the target nucleic acids in the nucleic acid sample. For example, RNA transcripts of one or more target genes may be hybridized to an array of oligonucleotide probes immobilized on a surface such as that of a semiconductor chip. Some of the probes on the surface have sequences that are perfectly complementary to particular target sequences and are referred to herein as perfect match (PM) probes. Also present on the chip are probes whose sequence is deliberately selected not to be perfectly complementary to a target sequence. Such probes are referred to as mismatched (MM) control probes, where for each PM probe, there is a MM control probe for the same particular target sequence. This mismatch may comprise one or more bases. Thus, the biological sample such as a mRNA sample can be analyzed for gene expression for hybridization to above-described microarray on a chip. The presence of RNA sequences that bind to the oligonucleotide probes on the chips are then detected by methods such as tagging with a fluorescence material and then detecting the fluorescence. Since sequences that are different from the target sequences may also bind to the PM probes that correspond to such target sequences, the fluorescence signals from such sequences would appear as noise. Signal-to-noise ratio is improved by calculating the difference from signals from the sequences that bind to the PM probes and the signals from sequences that bind to the MM probes.
Another type of microarray that has been used for analyzing gene expression utilizes cDNA probes. Although massive amounts of data are generated using oligonucleotide or cDNA probes, quantitative methods are needed to determine whether differences in gene expression are experimentally significant. Previous work on microarrays has utilized cluster analysis, to find coherent in expression patterns among genes or in cells. See, for example, the following three articles:    1. Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Rosenwal, A., Boldrick, J., Sabet, H., Tran, T., Yu, X., Marti, G., Moore, T., J, H., Lu, L., Lewis, D., Tibshirani, R., Sherlock, G., Chan, W., Greiner, T., Weisenburger, D., Armitage, K., Levy, R., Wilson, W., Greve, M., Byrd, J., Botstein, D., Brown, P. & Staudt, L. (2000) Nature 403, 503-511.    2. Eisen, M., Spellman, P., Brown, P. & Botstein, D. (1998) Proc. Natl. Acad. Sci. USA 95, 14863-14868.    3. Weinstein, J., Myers, T., O'Connor, P., Friend, S., Fornace, A., Kohn, K., Fojo, T., Bates, S., Rubinstein, L., Anderson, N., Buolamwini, J., van Osdol, W., Monks, A., Scudiero, D., Sausville, E., Zaharevitz, D., Bunow, B., Viswanadhan, V., Johnson, G., Wittes, R. & Paull, K. (1997) Science 275, 343-349.
Cluster analysis works best for a large number of samples. Moreover, cluster analysis provides little information about statistical significance. To answer biologically important questions, a method is needed which can analyze a relatively small number of samples and provide a measure of statistical certainty. Methods based on conventional t-tests provide the probability (p) that a difference in gene expression occurred by chance. See for example, the following articles:    4. Roberts, C., Nelson, B., Marton, M., Stoughton, R., Meyer, M., Bennett, H., He, Y., Dai, H., Walker, W., Hughes, T., Tyers, M., Boone, C. & Friend, S. (2000) Science 287, 873-880.    5. Galitski, T., Saldanha, A., Styles, C., Lander, E. & Fink, G. (1999) Science 285, 251-254.
In conventional t tests, p=0.01 may be significant in the context of experiments designed to evaluate small numbers of genes. However, a microarray experiment for 10,000 genes would identify 100 genes by chance.
One approach for ascertaining the statistical significance of microarray data is known as the “fold change” method. In this approach, if one were interested in measuring the effects of radiation on gene expression, a number of biological samples are subjected to radiation, and their gene expression is then measured. Other biological samples are measured without being subjected to radiation. The “fold change” method identifies genes as having been changed significantly by the radiation if the ratio of the average gene expression measured after being subjected to the radiation to the gene expression measured without being subjected to radiation is greater than a certain threshold or less then another threshold. As further explained below, the “fold change” method, in some instances, yields unacceptably high false discovery rates.
In one attempt to improve on the “fold change” method, genes are identified to be significantly changed if a certain fold change is observed consistently between paired samples. While this yields a moderate improvement over the “fold change” method, this improved “pair wise fold change” method still yields a rather high false discovery rate.
As also noted above, conventional techniques analyze differences in gene expression levels, such as PM-MM, so that negative expression values are possible during analysis. Conventional methods of calculation and graphical representation employ log-log plots which do not permit negative values. Where linear plots are used instead for representing such possible negative values, it is found, however, that most of the values in the plots tend to congregate in a small area so that it is difficult to resolve them visually. It is, therefore, desirable to provide improved techniques for calculation and representation of data.
It is, therefore, desirable to provide an improved system for analyzing and representing data obtained from microarrays whereby the above-described difficulties are alleviated.