Many biological functions are carried out by regulating the expression levels of various genes, either through changes in the copy number of the genetic DNA, through changes in levels of transcription (e.g. through control of initiation, provision of RNA precursors, RNA processing, etc.) of particular genes, or through changes in protein synthesis. For example, control of the cell cycle and cell differentiation, as well as diseases, are characterized by the variations in the transcription levels of a group of genes.
Recently, massive parallel gene expression monitoring methods have been developed to monitor the expression of a large number of genes using nucleic acid array technology which was described in detail in, for example, U.S. Pat. No. 5,871,928; de Saizieu, et al., 1998, Bacteria Transcript Imaging by Hybridization of total RNA to Oligonucleotide Arrays, Nature Biotechnology, 16:45-48; Wodicka et al., 1997, Genome-wide Expression Monitoring in Saccharomyces cerevisiae, Nature Biotechnology 15:1359-1367; Lockhart et al., 1996, Expression Monitoring by Hybridization to High Density Oligonucleotide Arrays. Nature Biotechnology 14:1675-1680; Lander, 1999, Array of Hope, Nature-Genetics, 21(suppl.), at 3.
Massive parallel gene expression monitoring experiments generate unprecedented amounts of information. For example, a commercially available GeneChip(copyright) array set is capable of monitoring the expression levels of approximately 6,500 murine genes and expressed sequence tags (ESTs) (Affymetrix, Inc, Santa Clara, Calif., USA). Effective analysis of the large amount of data may lead to the development of new drugs and new diagnostic tools. Therefore, there is a great demand in the art for methods for organizing, accessing and analyzing the vast amount of information collected using massive parallel gene expression monitoring methods.
Accordingly, the current invention provides methods and computer software products for analyzing data from gene expression monitoring experiments that employ multiple probes against a single target.
In one aspect of the invention, methods, preferably implemented using a digital computer, for determining the relative level of a biological molecule in a plurality of experiments are provided. In some embodiments, a plurality of signals where each of the signals reflects the level of the biological molecule in one of the experiments are determined. The relative level of the molecule is then determined by calculating a principal component. In preferred embodiments, the biological molecule is a nucleic acid such as a transcript of a gene. The signals reflect the hybridization of nucleic acid probes, at least 3 probes, preferably at least 5 probes, more preferably at least 10 probes, even more preferably at least 15 probes and in some instances at least 20 probes, with the target nucleic acid. Preferably, the probes are immobilized on a solid substrate. In a particularly preferred embodiment, the signals are derived from hybridization between perfect match probes (PM) designed to be complementary against the target nucleic acid and mismatch probes (MM) designed to contain at least one mismatch against the target nucleic acid. In one embodiment, the signals are the hybridization intensity difference (PMxe2x88x92MM). A matrix T (T=Sxc2x7{tilde over (S)}) is calculated to determine the principal components. The matrix S contains the measurements of n probes in m experiments. It may be represented as:   S  =      [                                        S                          1              ⁢              l                                                .                                      S                          1              ⁢              j                                                .                                      S                          1              ⁢              n                                                            .                          .                          .                          .                          .                                      .                          .                          .                          .                          .                                      ;                          .                          .                          .                          .                                                  S                          m              ⁢                              xe2x80x83                            ⁢              l                                                .                                      S                          m              ⁢                              xe2x80x83                            ⁢              i                                                .                                      S            mn                                ]  
where Sij is the signal of the jth probe reflects the level of the molecule in the ith experiment. Eigenvectors, ei, and their corresponding eigenvalues, xcex, of the matrix Tare calculated. The relative level of the molecule is indicated with emax, the eigenvector associated with the largest eigenvalue.
In some embodiments, the angles (xcex8j) between the vector emax, and each of the signal vectors (Sj) are calculated. The Vector Sj may be represented by:       S    j    =            [                                                  S                              1                ⁢                j                                                                          .                                                              S              ij                                                            .                                                              S              ij                                          ]        .  
If any xcex8j is substantially different from the others, the probes may have detected a sequence variation from the reference sequence used to design the probes. The sequence variation may be the target region of a probe (j) associated with the xcex8j which is different from others.
In another aspect of invention, methods for selecting nucleic acid probes from a pool of candidate nucleic acid probes are provided. In some embodiments, hybridization intensities between each of the candidate probes with the target nucleic acid in a plurality of experiments are measured. The inner product of normalized eigenvector associated with the largest eigenvalue and normalized experimental hybridization intensity for each candidate probe is calculated. The probes with the highest inner product values are selected. The nucleic acid probes and the candidate nucleic acid probes may be oligonucleotide probes immobilized on a substrate.
In another aspect of the invention, computer software products are provided for analyzing the level of a biological molecule, preferably a transcript of a gene. The computer software product contains computer program code that inputs a plurality of signals. The signals reflect the level of the biological molecule in one of a plurality of experiments. The computer software product also contains computer program code that determines the relative level of the biological molecule by calculating at least one principal component. The computer program codes are stored in a computer readable media. The biological molecule is preferably a nucleic acid, such as a transcript of a gene, and the plurality of signals reflect the hybridization of a plurality of nucleic acid probes with the nucleic acid. In some embodiments, the signals are derived from hybridization between perfect match probes (PM) designed to be complementary against a target nucleic acid and mismatch probes (MM) designed to contain at least one mismatch against the target nucleic acid. The signals may be the intensity difference (PMxe2x88x92MM).
In some embodiments, the computer software product calculates a matrix T=Sxc2x7{tilde over (S)} where:   S  =      [                                        S                          1              ⁢              l                                                .                                      S                          1              ⁢              j                                                .                                      S                          1              ⁢              n                                                            .                          .                          .                          .                          .                                      .                          .                          .                          .                          .                                      ;                          .                          .                          .                          .                                                  S                          m              ⁢                              xe2x80x83                            ⁢              l                                                .                                      S                          m              ⁢                              xe2x80x83                            ⁢              i                                                .                                      S            mn                                ]  
where Sij is the signal of the jth probe reflects the level of the target nucleic acid in the ith experiment. The computer software product also calculates eigenvectors, ei, and their corresponding eigenvalues, xcex, of said matrix T; and indicates the relative level with emax, the eigenvector associated with the largest eigenvalue. In some embodiments, the computer software product also contains computer program code that computes the angles (xcex8j) between said emax and each of the signal vectors (Sj), where             S      j        =          [                                                  S                              1                ⁢                j                                                                          .                                                              S              ij                                                            .                                                              S              ij                                          ]        ;
and computer program code that indicates that sequence variation has been detected if any xcex8j is substantially different from the others. The sequence variation is indicated as in the target region of a probe (j) associated with said any xcex8j.
In another aspect of the invention, methods for determining a canonical vector (C) or analyzing multiple probe nucleic acid hybridization are provided. A canonical vector is used to calculate a gene expression index (GEI) or other measurement of gene expression from intensity data obtained from multiple probes. The GEI may be calculated as follows:   GEI  =            C      ·              [                                                            S                1                                                                        .                                                                          S                j                                                                        .                                                                          S                n                                                    ]              =                  [                              c            1                    ·                      c            j                    ·                      c            n                          ]            ·              [                                                            S                1                                                                        .                                                                          S                j                                                                        .                                                                          S                n                                                    ]            
where: Sj is hybridization intensity for the jth probe and cj is the value for the jth probe. The GEI may then be used as a relative level of expression, for calculating the absolute amount of the transcript (with appropriate controls) and for making a qualitative or semi-qualitative calls (present, absent, etc.)
In a preferred embodiment, the probes for a large number of genes are synthesized or deposited on a substrate to make a gene expression monitoring chip. The probes (preferably immobilized on a chip) are tested on various samples. The samples may represent various states of the expression of the target gene. The hybridization intensity values obtained constitutes a vector S of equation 1 for each target gene. The vector is of the size mxc3x97n. m is the number of samples tested and n is the number of probes for a target gene (the number of probes may be different for different target genes). A vector P may be calculated by multiplying the transposed S with S:
P={tilde over (S)}xc2x7Sxe2x80x83xe2x80x83(Equation 7)
P has the dimension of nxc3x97n.
The eigenvector of P of matrix P associated with the largest eigenvalue may be used as a canonical vector.