Nucleic acids (DNA and RNA) carry within their sequence the hereditary information and are therefore the prime molecules of life. Nucleic acids are found in all living organisms including bacteria, fungi, viruses, plants and animals. It is of interest to determine the relative abundance of nucleic acids in different cells, tissues and organisms over time under various conditions, treatments and regimes.
All dividing cells in the human body contain the same set of 23 pairs of chromosomes. It is estimated that the 22 autosomal and the sex chromosomes encode approximately 100,000 genes. The differences among different types of cells are believed to reflect the differential expression of the 100,000 or so genes. Fundamental questions of biology could be answered by understanding which genes are transcribed and what the relative abundance of transcripts in different cells.
Previously, the art has only provided for the analysis of a few known genes at a time by standard molecular biology techniques such as PCR, northern blot analysis, or other types of DNA probe analysis such as in situ hybridization. Each of these methods allows one to analyze the transcription of only known genes and/or small numbers of genes at a time. Nucl. Acids Res. 19, 7097-7104 (1991); Nucl. Acids Res. 18, 4833-4842 (1990); Nucl. Acids Res. 18, 2789-2792 (1989); European J. Neuroscience 2, 1063-1073 (1990); Analytical Biochem. 187, 364-373 (1990); Genet. Annal Techn. Appl. 7, 64-70 (1990); GATA 8(4), 129-133 (1991); Proc. Natl. Acad. Sci. USA 85, 1696-1700 (1988); Nucl. Acids Res. 19, 1954 (1991); Proc. Natl. Acad. Sci. USA 88, 1943-1947 (1991); Nucl. Acids Res. 19, 6123-6127 (1991); Proc. Natl. Acad. Sci. USA 85, 5738-5742 (1988); Nucl. Acids Res. 16, 10937 (1988).
Studies of the number and types of genes whose transcription is induced or otherwise regulated during cell processes such as activation, differentiation, aging, viral transformation, morphogenesis, and mitosis have been pursued for many years, using a variety of methodologies. One of the earliest methods was to isolate and analyze levels of the proteins in a cell, tissue, organ system, or even organism both before and after to the process of interest. One method of analyzing multiple proteins in a sample is using 2-dimensional gel electrophoresis, wherein proteins can be, in principle, identified and quantified as individual bands, and ultimately reduced to a discrete signal. In order to positively analyze each band, each band must be excised from the membrane and subjected to protein sequence analysis using Edman degradation. Unfortunately, most of the bands were present in quantities too small to obtain a reliable sequence, and many of those bands contained more than one discrete protein. An additional difficulty is that many of the proteins were blocked at the amino-terminus, further complicating the sequencing process.
Analyzing differentiation at the gene transcription level has overcome many of these disadvantages and drawbacks, since the power of recombinant DNA technology allows amplification of signals containing very small amounts of material. The most common method, called "hybridization subtraction", involves isolation of mRNA from the biological sample before (B) and after (A) the developmental process of interest, transcribing one set of mRNA into cDNA, subtracting sample B from sample A (mRNA from cDNA) by hybridization, and constructing a cDNA library from the non-hybridizing mRNA fraction. Many different groups have used this strategy successfully, and a variety of procedures have been published and improved upon using this same basic scheme (Nucl. Acids Res. 19, 7097-7104 (1991); Nucl. Acids Res. 18, 4833-4842 (1990); Nucl. Acids Res. 18, 2789-2792 (1989); European J. Neuroscience 2, 1063-1073 (1990); Analytical Biochem. 187, 364-373 (1990); Genet. Annal Techn. Appl. 7, 64-70 (1990); GATA 8(4), 129-133 (1991); Proc. Natl. Acad. Sci. USA 85, 1696-1700 (1988); Nucl. Acids Res. 19, 1954 (1991); Proc. Natl. Acad. Sci. USA 88, 1943-1947 (1991); Nucl. Acids Res. 19, 6123-6127 (1991); Proc. Natl. Acad. Sci. USA 85, 5738-5742 (1988); Nucl. Acids Res. 16, 10937 (1988).
All of these techniques have particular strengths and weaknesses; however, there are still some limitations and undesirable aspects of these methods: First, the time and effort required to construct such libraries is quite large. Typically, a trained molecular biologist might expect construction and characterization of such a library to require 3 to 6 months, depending on the level of skill, experience, and luck. Second, the resulting subtraction libraries are typically inferior to the libraries constructed by standard methodology. A typical conventional cDNA library should have a clone complexity of at least 10.sup.6 clones, and an average insert size of 1-3 kB. In contrast, subtracted libraries can have complexities of 10.sup.2 or 10.sup.3 and average insert sizes of 0.2 kBp. Therefore, there can be a significant loss of clone and sequence information associated with such libraries. Third, this approach allows the researcher to capture only the genes induced in sample A relative to sample B; not vice-versa, nor does it easily allow comparison to a third sample of interest (C). Fourth, this approach requires very large amounts (hundreds of micrograms) of "driver" mRNA (sample A), which significantly limits the number and type of subtractions that are possible since many tissues and cells are very difficult to obtain in large quantities.
Fifth, the resolution of the subtraction is dependent upon the physical properties of DNA:DNA or RNA:DNA hybridization. The ability of a given sequence to find a hybridization match is dependent on its unique CoT value. The CoT value is a function of the number of copies (concentration) of the particular sequence, multiplied by the time of hybridization. It follows that for sequences which are abundant, hybridization events will occur very rapidly (low CoT value), while rare sequences will form duplexes at very high CoT values. Unfortunately, the rare genes, or those present at abundances of 10.sup.-4 -10.sup.-7 and those in which an investigator would likely be most interested, are lost. CoT values which allow such rare sequences to form duplexes are difficult to achieve in a convenient time frame. Therefore, hybridization subtraction is simply not a useful technique with which to study relative levels of rare mRNA species. Sixth, this problem is further complicated by the fact that duplex formation is also dependent on the nucleotide base composition for a given sequence. Those sequences rich in G+C form stronger duplexes than those with high contents of A+T. Therefore, the former sequences will tend to be removed selectively by hybridization subtraction. Seventh, it is possible that hybridization between nonexact matches can occur. When this happens, the expression of a homologous gene may "mask" expression of a gene of interest, artificially skewing the results for that particular gene.
Matsubara and Okubo proposed using partial cDNA sequences to establish expression profiles of genes which could be used in functional analyses of the human genome. Matsubara and Okubo warned against using random priming, as it creates multiple unique DNA fragments from individual mRNAs and may thus skew the analysis of the number of particular mRNAs per library. They sequenced randomly selected members from a 3'-directed cDNA library and established the frequency of appearance of the various ESTs. They proposed comparing lists of ESTs from various cell types to classify genes. Genes expressed in many different cell types were labeled housekeepers and those selectively expressed in certain cells were labeled cell-specific genes, even in the absence of the full sequence of the gene or the biological activity of the gene product.
The present invention avoids the drawbacks of the prior art by providing a method to quantify the relative abundance of multiple gene transcripts in a given biological sample by the use of high-throughput sequence-specific analysis of individual RNAs and/or their corresponding cDNAs.
The present invention offers several advantages over current protein discovery methods which attempt to isolate individual proteins based upon biological effects. The method of the instant invention provides for detailed diagnostic comparisons of cell profiles revealing numerous changes in the expression of individual transcripts.
The instant invention provides several advantages over previous subtraction methods including a more complex library analysis (10.sup.6 to 10.sup.7 clones as compared to 10.sup.3 clones) which allows identification of low abundance messages as well as enabling the identification of messages which either increase or decrease in abundance. These large libraries are very routine to make in contrast to the libraries of previous methods. In addition, homologues can easily be distinguished with the method of the instant invention.
This method is very convenient because it organizes a large quantity of data into a comprehensible, digestable format. The most significant differences are highlighted by electronic subtraction. In depth analyses are made more convenient.
The present invention provides several advantages over previous methods of electronic analysis of cDNA. The method is particularly powerful when more than 100 and preferably more than 1,000 gene transcripts are analyzed. In such a case, new low-frequency transcripts are discovered and tissue tipped.
High resolution analysis of gene expression can be used directly as a diagnostic profile or to identify disease-specific genes for the development of more classic diagnostic approaches.
This process is defined as gene transcript frequency analysis. The resulting quantitative analysis of the gene transcripts is defined as comparative gene transcript analysis.