As molecular biological and genetics research have advanced, it has become increasingly clear that the temporal and spatial expression of genes plays a vital role in processes occurring in both health and in disease. Moreover, the field of biology has progressed from an understanding of how single genetic defects cause the traditionally recognized hereditary disorders (e.g., the thalassemias), to a realization of the importance of the interaction of multiple genetic defects in concert with various environmental factors in the etiology of the majority of the more complex disorders, such as neoplasia.
For example, in the case of neoplasia, recent experimental evidence has demonstrated the key causative roles of multiple defects in several pivotal genes causing their altered expression. Other complex diseases have been shown to have a similar etiology. Therefore, the more complete and reliable a correlation which can be established between gene expression and disease states, the better diseases will be able to be recognized, diagnosed and treated. This important correlation may be established by the quantitative determination and classification of DNA expression in tissue samples.
Genomic DNA ("gDNA") sequences are those naturally occurring DNA sequences constituting the genome of a cell. The overall state of gene expression within genomic DNA ("gDNA") at any given time is represented by the composition of cellular messenger RNA ("mRNA"), which is synthesized by the regulated transcription of gDNA. Complementary DNA ("cDNA") sequences may be synthesized by the process of reverse transcription of mRNA by use of viral reverse transcriptase. cDNA derived from cellular mRNA also represents, albeit approximately, gDNA expression within a cell at a given time. Accordingly, a methodology which would allow the rapid, economical and highly quantitative detection of all the DNA sequences within particular cDNA or gDNA samples is extremely desirable.
Heretofore, gene-specific DNA analysis methodologies have not been directed to the determination or classification of substantially all genes within a DNA sample representing the total transcribed cellular mRNA population and have universally required some degree of nucleic acid sequencing to be performed. As a result, existing cDNA and gDNA, analysis techniques have been directed to the determination and analysis of only one or two known or unknown genetic sequences at a single time. These techniques have typically utilized probes which are synthesized to specifically recognize (by the process of hybridization) only one particular DNA sequence or gene. See e.g., Watson, J. 1992. Recombinant DNA, chap 7, (W. H. Freeman, New York.). Furthermore, the adaptation of these methods to the recognition of all sequences within a sample would be, at best, highly cumbersome and uneconomical.
One existing method for detecting, isolating and sequencing unknown genes utilizes an arrayed cDNA library. From a particular tissue or specimen, mRNA is isolated and cloned into an appropriate vector, which is introduced into bacteria (e.g., E. coli) through the process of transformation. The transformed bacteria are then plated in a manner such that the progeny of individual vectors bearing the clone of a single cDNA sequence can be separately identified. A filter "replica" of such a plate is then probed (often with a labeled DNA oligomer selected to hybridize with the cDNA representing the gene of interest) and those bacteria colonies bearing the cDNA of interest are identified and isolated. The cDNA is then extracted and the inserts contained therein is subjected to sequencing via protocols which includes, but are not limited to the dideoxynucleotide chain termination method. See Sanger, F., et al. 1977. DNA Sequencing with Chain Terminating Inhibitors. Proc. Natl. Acad. Sci. USA 74(12):5463-5467.
The oligonucleotide probes utilized in colony selection protocols for unknown gene(s) are synthesized to hybridize, preferably, only with the cDNA for the gene of interest. One method of achieving this specificity is to start with the protein product of the gene of interest. If a partial sequence (i.e., from a peptide fragment containing 5 to 10 amino acid residues) from an active region of the protein of interest can be determined, a corresponding 15 to 30 nucleotide (nt.) degenerate oligonucleotide can be synthesized which would code for this peptide fragment. Thus, a collection of degenerate oligonucleotides will typically be sufficient to uniquely identify the corresponding gene. Similarly, any information leading to 15-30 nt. subsequences can be used to create a single gene probe.
Another existing method, which searches for a known gene in cDNA or gDNA prepared from a tissue sample, also uses single-gene or single-sequence oligonucleotide probes which are complementary to unique subsequences of the already known gene sequences. For example, the expression of a particular oncogene in sample can be determined by probing tissue-derived cDNA with a probe which is derived from a subsequence of the oncogene's expressed sequence tag. The presence of a rare or difficult to culture pathogen (e.g., the TB bacillus) can also be determined by probing gDNA with a hybridization probe specific to a gene possessed by the pathogen. Similarly, the heterozygous presence of a mutant allele in a phenotypically normal individual, or its homozygous presence in a fetus, may be determined by the utilization of an allele-specific probe which is complementary only to the mutant allele. See e.g., Guo, N. C., et al. 1994. Nucleic Acid Research 22:5456-5465).
Currently, all of the existing methodologies which utilize single-gene probes, if applied to determine all of the genes expressed within a given tissue sample, would require many thousands to tens-of-thousands of individual probes. It has been estimated that a single human cell typically expresses approximately 5,000 to 15,000 genes simultaneously, and that the most complex types of tissues (e.g., brain tissue) can express up to one-half of the total genes contained within the human genome. See Liang, et al. 1992. Differential Display of Eukaryotic Messenger RNA by Means of the Polymerase Chain Reaction. Science 257:967-971. It is obvious that an screening methodology which requires such a large number of probes is clearly far too cumbersome to be economic or, even practical.
In contrast, another class of existing methods, known as sequencing-by-hybridization ("SBH"), utilize combinatorial probes which are not gene specific. See e.g., Drmanac, et al. 1993. Science 260:1649-1652; U.S. Pat. No. 5,202,231 to Drmanac, et al. An exemplar implementation of SBH for the determination of an unknown gene requires that a single cDNA clone be probed with all DNA oligomers of a given length, say, for example, all 6 nt. oligomers. A set of oligomers of a given length which are synthesized without any type of selection is called a combinatorial probe library. A partial DNA sequence for the cDNA clone can be reconstructed by algorithmic manipulations from the hybridization results for a given combinatorial library (i.e., the hybridization results for the 4096 oligomer probes having a length of 6 nt.). However, complete nucleotide sequences are not determinable, because the repeated subsequences cannot be fully ascertained in a quantitative manner.
SBH which is adapted to the identification of known genes is called oligomer sequence signatures ("OSS"). See e.g., Lennon, et al. 1991. Trends In Genetics 7(10:314-317. OSS classifies a single clone based upon the pattern of probe "hits" (i.e., hybridizations) against an entire combinatorial library, or a significant sub-library. This methodology requires that the tissue sample library be arrayed into clones, wherein each clone comprises only a single sequence from the library. This technique cannot be applied to mixtures of sequences.
These previous, exemplar methodologies are all directed to finding one sequence in an array of clones--with each clone expressing a single sequence from a given tissue sample. Accordingly, they are not directed to rapid, economical, quantitative, and precise characterization of all the DNA sequences in a mixture of sequences, such as a particular total cellular cDNA or gDNA sample, and their adaptation to such a task would be prohibitive. Determination by sequencing the DNA of a clone, much less an entire sample of thousands of genomic sequences, is not rapid or inexpensive enough for economical and useful diagnostics. As previously discussed, existing probe-based techniques of gene determination or classification, whether the genes are known or unknown, require many thousands of probes, each specific to one possible gene to be observed, or at least thousands or even tens of thousands of probes in a combinatorial library. Further, all of these aforementioned methods require the sample be arrayed into clones each expressing a single gene of the sample.
In contrast to the prior exemplar gene determination and classification techniques, another methodology, known as differential display, attempts to "fingerprint" a mixture of expressed genes, as is found in a pooled cDNA library. This "fingerprint," however, seeks merely to establish whether two samples are the same or different. No attempt is made to determine the quantitative, or even qualitative, expression of particular genes. See e.g., Liang, et al. 1995. Curr. Opin. Immunol. 7:274-280; Liang, et al. 1992. Science 257:967-971; Welsh, et al. 1992. Nuc. Acid Res. 20:4965-4970; McClelland, et al. 1993. Exs. 67:103-115 and Lisitsyn, 1993. Science 259:946-950. Differential display uses the polymerase chain reaction ("PCR") to amplify DNA subsequences of various lengths, which are then defined by their being between the annealing sites of arbitrarily selected primers. Polymerase chain reaction method and apparatus are well known. See, e.g., U.S. Pat. Nos. 4,683,202; 4,683,195; 4,965,188; 5,333,675; each herein fully incorporated by reference. Ideally, the pattern of the lengths observed is characteristic of the specific tissue from which the library was originally prepared. Typically, one of the primers utilized in differential display is oligo(dT) and the other is one or more arbitrary oligonucleotides which are designed to hybridize within a few hundred base pairs (bp.) of the homopolymeric poly-dA tail of a cDNA within the library. Thereby, upon electrophoretic separation, the amplified fragments of lengths up to a few hundred base pairs should generate bands which are characteristic and distinctive of the sample. In addition, changes in gene expression within the tissue may be observed as changes in one or more of the cDNA bands.
In the differential expression methodology, although characteristic electrophoretic banding patterns develop, no attempt is made to quantitatively "link" these patterns to the expression of particular genes. Similarly, the second arbitrary primer also cannot be traced to a particular gene due to the following reasons. First, the PCR process is less than ideally specific. One to several base pair mismatches are permitted by the lower stringency annealing step which is typically utilized in this methodology and are generally tolerated well enough so that a new chain can actually be initiated by the Tag polymerase often used in PCR reactions. Secondly, the location of a single subsequence (or its absence) is insufficient to distinguish all expressed genes. Third, the resultant bp.-length information (i.e., from the arbitrary primer to the poly-dA tail) is generally not found to be characteristic of a sequence due to: (i) variations in the processing of the 3'-untranslated regions of genes, (ii) variation in the poly-adenylation process and (iii) variability in priming to the repetitive sequence at a precise point. Therefore, even the bands which are produced often are smeared by numerous, non-specific background sequences.
Moreover, known PCR biases towards nucleic acid sequences containing high G+C content and short sequences, further limit the specificity of this methodology. In accord, this technique is generally limited to the "fingerprinting" of samples for a similarity or dissimilarity determination and is precluded from use in quantitative determination of the differential expression of identifiable genes.
Thus, in conclusion, the existing methodologies utilized for gene or DNA sequence classification and determination are in need of improvement with respect to their ability to perform a highly specific quantitative determination of the components of a cDNA mixture prepared from a tissue sample in a rapid, economical and reproducible manner.