Differential Gene Expression.
The pathology of many diseases involves differences in gene expression; indeed, normal tissue and diseased tissue can often be distinguished by the types of active genes and their expression levels. For example, cancer cells evolve from normal cells to highly invasive, metastatic malignancies, which frequently are induced by activation of oncogenes, or inactivation of tumor suppressor genes. See, The National Cancer Institute, "The Nation's Investment In Cancer Research: A Budget Proposal For Fiscal Years 1997/98", Prepared by the Director, National Cancer Institute, pp. 55-77. Altered expression patterns of oncogenes and tumor suppressor genes in turn effect dramatic changes in the expression profiles of numerous other genes. Differentially expressed sequences can serve as markers of the transformed state and are, therefore, of potential value in the diagnosis and classification of tumors. Differences in gene expression, which are not the cause but rather the effect of transformation, may be used as markers for the tumor stage. Thus, the assessment of the expression profiles of known tumor-associated genes has the potential to provide meaningful information with respect to tumor type and stage, treatment methods, and prognosis. Furthermore, new tumor-associated genes may be identified by systemically comparing the expression of genes in tumor specimens with their expression in control tissue. Genes whose levels are increased in tumors relative to normal cells are candidates for genes encoding growth-promoting products, e.g., oncogenes. In contrast, genes whose expression is reduced in tumors are candidates for genes encoding growth inhibiting products, e.g., tumor suppressor genes or genes encoding apoptosis-inducing products. Generally, the underlying premise is that the profiles of gene expression may point to the physiological function or malfunction of the gene product in the organism.
Pathological gene expression differences are not confined to cancer. Autoimmune disorders, restenosis, atherosclerosis, neurodegenerative diseases, and numerous others can be expected to involve aberrant expression of particular genes. Significant resources have been expended in recent years to identify and isolate genes relevant to these diseases. Accordingly, an efficient method allowing the comparative assessment of the relative amounts of nucleic acids in complex mixtures, and the retrieval of specific nucleic acids from those complex mixtures, would be an extremely valuable tool for genetic and medical research.
In the past, the comparison of the expression levels of specific transcripts among different cell or tissue types, tissues or cells derived from different disease or developmental stages, or from cells exposed to different stimuli has provided meaningful information with respect to a gene's function or its role in the development of a disease. Approaches based on the determination of differences in the expression profiles of genes have facilitated the identification of novel genes encoding products having a function of interest. For example, such approaches have permitted the identification of several genes, for example T cell receptor genes (Yanagi et al., 1984, Nature 308:145-149), and a number of tumor suppressor genes, including p21 (el-Deiry et al., 1993, Cell 75:817-825; Noda et al., 1994, Exp. Cell. Res. 211:90-98). Further, comparative assessment of relative amounts of nucleic acids has the potential to provide a valuable parameter for the organization of sequence information obtained through large scale sequencing approaches.
Genetics.
Methods that permit the rapid enrichment and subsequent identification of sequences that cause specific changes in cell behavior are highly desirable. With these methods, specific functions may be assigned to genes or gene fragments based on their activity in cells. Traditional genetics involves isolation of mutants that have particular phenotypes. In combination with modern molecular methods, it is possible to isolate the mutant genes responsible for a specific phenotype. See, e.g., Kamb et al., 1987, Cell 50:405-410. In general, however, the process of positional gene cloning, i.e., cloning a gene based on its genetic location, is laborious. It is also possible to clone genes by expression. For example, several oncogenes have been identified based on their ability to cause cell proliferation when introduced into cells. Der et al., 1982, Proc. Natl. Acad. Sci. U.S.A. 79:3637-3640; Prada et al., 1982 Nature 297:474-478. It is especially valuable to use methods that can not only identify sequences that enhance cell proliferation, but also identify sequences that inhibit cell growth. Even more valuable, are methods that can identify such sequences that have effects specific to certain cell types (e.g., a sequence that inhibits growth of tumor cells but not normal cells). The method described herein is capable of achieving such results.
Differences In Genomic DNA.
Differences in genomic DNA are the underlying basis for differences between species and for much of the individual variation within a species. Furthermore, many pathological disorders, i.e., genetic disorders, are driven by chromosomal mutations. Rowley, 1990, Cancer Res. 50:3816-3825. Identification of differences in the genome and understanding of their effect on the phenotype of the organism provides valuable insight into the development of inherited diseases.
Many methods have been used to characterize variation between different DNA samples. These involve crude methods of analysis such as overall DNA base composition, melting curves, solution hybridization at different stringencies, and measurements of percentages of modified bases and genome size. Progressively more refined methods have been applied over the years including restriction mapping and DNA sequence analysis. Botstein et al., 1980, Am. J. Hum. Genet. 32:314-331; Lipshutz et al., 1995, Biotechniques 19:442-447. Ultimately, the DNA sequence gives the most detailed and reliable information. However, sequencing, as a systematic approach for genomic analysis, is slow and expensive. Indeed, genomic sequencing has been limited to a few particularly interesting genes or genetic intervals.
Thus, there is an unmet need for an efficient method that allows direct screening of genomic DNA to detect differences in DNA sequence, ploidy (copy number), and/or promoter activity in a high through-put manner.
Current Means For The Quantitative Determination Of Relative Amounts Of Specific Nucleic Acids.
The technical hurdles associated with the quantitative determination of relative amounts of nucleic acids, e.g., the determination of MRNA profiles or the determination of sequence ploidy, are daunting. Often, only a few copies of a particular nucleic acid may be present within complex mixtures. For example, many transcripts are present only at a very low abundance. Thus, a highly sensitive method is required to detect as little as one mRNA molecule per cell. In the case of genomic DNA, it might be desired to detect deletions or amplifications against a background of 3.times.10.sup.9 base pairs in the human genome. Furthermore, the availability of sample mRNA/cDNA/genomic DNA may be rather limited. Thus, the absolute number of nucleic acid molecules in a sample may be very small. Moreover, the expression levels of genes vary greatly, ranging from a single MRNA molecule per cell up to about 5,000 MRNA molecules per cell. Given 10,000 different MRNA types per cell on average, and a total of 500,000 mRNA molecules per cell, the required detection range is tremendous. Additionally, the level of each specific nucleic acid molecule (MRNA, cDNA, genomic DNA fragment) must be determined separately with a corresponding specific probe, which may be labor- and resource-intensive.
To date, a number of general methods have been developed to quantify nucleic acid molecules. Many of the available methods are suited to assess presence or absence, or relative amounts of specific nucleic acids, in particular mRNA, expressed in different cell or tissue types. However, each of these methods has problems, especially when it is an objective to analyze large numbers of targets and the available amounts of sample nucleic acids are a limiting factor.
A traditional method for the assessment of MRNA expression profiles is Northern blot analysis. Crude RNA or MRNA derived from different sources is separated by gel electrophoresis, and transferred to a nitrocellulose or nylon filter. Immobilized on the filter, the mRNA is hybridized with a probe corresponding to sequences of the gene of interest. See, Sambrook et al., 1990, Molecular Cloning: A Laboratory Manual. Cold Spring Harbour Laboratory Press, New York. Northern blot analysis is a highly sensitive approach for determining the expression profile of small numbers of sequences of interest. However, this type of assay is not suited for analysis of large numbers of probes.
A second approach for the determination of MRNA expression profiles based on identification of differentially expressed sequences employs DNA probe hybridization to filters. Palazzolo et al., 1989, Neuron 3:527-539; Tavtigian et al., 1994, Mol Biol Cell 5:375-388. In this method, phage or plasmid DNA libraries, typically cDNA libraries, are plated at high density on duplicate filters. The two filter sets are screened independently with cDNA prepared from two sources. The signal intensities of the various individual clones are compared between the two duplicate filter sets to determine which clones hybridize preferentially to cDNA from one source compared to the other. These clones are isolated and tested to verify that they represent sequences that are preferentially present in one of the two original samples. The major drawback with this approach is its lack of sensitivity. It is typically impossible to identify differentially expressed sequences that are present in amounts of less than one (1) occurrence in as much as 1,000 to 10,000 sequences. In addition, for detection there must be a relative large disparity in expression of a particular sequence.
A third approach involves the screening of cDNA libraries derived from subtracted mRNA populations. Hedrick et al., 1984, Nature 308:149-153. The method is closely related to the method of differential hybridization described above, but the cDNA library is prepared so as to favor clones from one mRNA sample over another. This is typically accomplished by a subtractive step prior to cloning in which the first strand of the cDNA from the first sample is hybridized to an excess of MRNA from the second sample, whereby the DNA/RNA heteroduplexes are removed. The remaining single stranded cDNA is converted into double-stranded cDNA and cloned into a phage or plasmid vector. The subtracted library so generated is depleted for sequences that are shared between the two sources of MRNA, and enriched for those that are uniquely present in the first sample. Clones from the subtracted library can be characterized directly. Alternatively, they can be screened by a subtracted cDNA probe, or on duplicate filters using two different probes as above. The advantage of this method is that the number of clones which need to be screened and analyzed is small. However, differential hybridization is technically very difficult. Furthermore, it lacks sensitivity, and is only suited for identification of differentially expressed sequences that are present in relative amounts higher than about one in 1.times.10.sup.4.
A fourth approach involves Expressed Sequence Tag (EST) sequencing. Lennon et al., 1996, Genomics 33:151-152. This method involves the direct analysis of individual clones from cDNA libraries by DNA sequencing. Libraries are generated from two sources that are the objects of comparison, and individual inserts of the libraries are sequenced. The frequency of particular sequences reflecting the relative abundance of specific sequences is recorded for each library. The most significant drawback of EST sequencing is its extreme time and resource inefficiency. In order to provide a reasonable sampling of each library, many thousands of individual insert sequences must be analyzed.
A fifth approach is Serial Analysis of Gene Expression (SAGE). Velculescu et al., 1995, Science 270:484-487. SAGE is closely related to the above method of EST sequencing. However, the libraries are constructed in such a way that small portions of many individual cDNAs are ligated together in tandem in a single vector. This has, compared to the EST approach, the advantage that multiple cDNAs are analyzed with each sequencing run which greatly reduces the amount of sequencing that must be carried out to achieve a similar level of completeness. Since a stretch of roughly a dozen nucleotides is sufficient in general to determine the identity of a particular transcript, this method is much faster. Each sequencing run can sample up to about fifty transcripts, rather than a single transcript as in the EST sequencing method. Nevertheless, the process is largely serial and necessitates sampling of all cDNAs that are present in equal amounts between the two samples, as well as those that are differentially expressed. This produces significant redundancy.
A sixth approach involves the differential display of MRNA. Liang et al., 1995, Methods Enzymol 254:304-321. PCR primers of arbitrary sequence, or designed to optimize the desired pseudo-random amplification, are used to amplify sequences from two mRNA samples by reverse transcription, followed by PCR. The products of these amplification reactions are run side by side, i.e., pairs of lanes contain the same primers but different MRNA samples, on DNA sequencing gels. Differences in the extent of amplification can be detected by eye. Bands that appear to be differentially amplified between the two samples can be excised from the gel and reamplified for characterization. If the collection of primers is suitably large, it is generally possible to identify at least one fragment that is differentially amplified in one sample compared with the second. The disadvantage of the method is its explicit reliance on random events, and the vagaries of PCR, which strongly bias the subset of sequences that can be detected by the method.
Yet another approach is Representational Difference Analysis (RDA) of nucleic acid populations from different samples. Lisitsyn et al., 1995, Methods Enzymol 254:291-304. RDA uses PCR to amplify fragments that are not shared between two samples. A hybridization step is followed by restriction digests to remove fragments that are shared from participation as templates in amplification. An amplification step allows retrieval of fragments that are present in higher amounts in one sample compared to the other. Again, the method is subject to the limitations of PCR and DNA hybridization which tend to bias the results strongly toward certain fragments and away from others. Furthermore, the final products of RDA are not representative of the differences that exist between the two input samples. RDA can be used with cDNA or with genomic DNA fragments to identify differences.
An eighth approach for the identification of differentially expressed sequences involves hybridization of labeled MRNA or cDNA in solution to DNA fragments or oligonucleotides attached to a solid support in high density arrays. Schena et al., 1995, Science 270:467-470. Since the arrays contain known sequences placed in defined locations, the hybridization signal intensities permit an assignment of the relative amount of target nucleic acid capable of hybridizing to a particular probe sequence. The method is parallel, rapid, and sensitive. Disadvantages are that the sequences in the array must be known beforehand, and that the hybridizing sequences cannot easily be recovered from the surface of the array.
While some of the above methods permit the determination of expression profiles of genes and the identification of sequences that have particular expression patterns, most are not sufficiently efficient and sensitive for comparative assessment of nucleic acids on a large scale. All existing methods have defects in either sensitivity, speed, comprehensiveness, or the ability to recover specific sequences, e.g., from a genetic library.
Therefore, the methods of the present invention, allowing the simultaneous assessment of relative amounts of a multiple MRNA species in two or more samples in an efficient manner and the recovery of sequences that have particular effects on cell phenotypes provide a long desired improvement over currently available methods. The methods of the invention also provide other advantages, such as increasing the throughput of probes, boosting the generation of valuable data, and significantly lowering the time and cost of analysis. Solid supports, specifically beads and microspheres, have been used to bind nucleic acid in solution, but not for the applications described for the invention herein (e.g., Bush et al., 1992, Anal. Biochem. 202:146-151; Meszaros and Morton, 1996, BioTechniques 20:413-419).