Over the past ten years, as biological and genomic research have revolutionized our understanding of the molecular basis of life, it has become increasingly clear that the temporal and spatial expression of genes is responsible for all life's processes, processes occurring in both health and in disease. Science has progressed from an understanding of how single genetic defects cause the traditionally recognized hereditary disorders, such as the thalassemias, to a realization of the importance of the interaction of multiple genetic defects along with environmental factors in the etiology of the majority of more complex disorders, such as cancer. In the case of cancer, current scientific evidence demonstrates the key causative roles of altered expression of and multiple defects in several pivotal genes. Other complex diseases have similar etiology. Thus the more complete and reliable a correlation that can be established between gene expression and health or disease states, the better diseases can be recognized, diagnosed and treated.
This important correlation is established by the quantitative determination and classification of DNA expression in tissue samples, and such a method which is rapid and economical would be of considerable value. Genomic DNA ("gDNA") sequences are those naturally occurring DNA sequences constituting the genome of a cell. The state of gene, or gDNA, expression at any time is represented by the composition of total cellular messenger RNA ("mRNA"), which is synthesized by the regulated transcription of gDNA. Complementary DNA ("cDNA") sequences are synthesized by reverse transcription from mRNA. cDNA from total cellular mRNA also represents, albeit approximately, gDNA expression in a cell at a given time. Consequently, rapid and economical detection of all the DNA sequences in particular cDNA or gDNA samples is desired, particularly so if such detection was rapid, precise, and quantitative.
Heretofore, gene specific DNA analysis techniques have not been directed to the determination or classification of substantially all genes in a DNA sample representing total cellular mRNA and have required some degree of sequencing. Generally, existing cDNA, and also gDNA, analysis techniques have been directed to the determination and analysis of one or two known or unknown genetic sequences at one time. These techniques have used probes synthesized to specifically recognize by hybridization only one particular DNA sequence or gene. (See, e.g., Watson et al., 1992, Recombinant DNA, chap 7, W. H. Freeman, New York.) Further, adaptation of these methods to the problem of recognizing all sequences in a sample would be cumbersome and uneconomical.
One existing method for finding and sequencing unknown genes starts from an arrayed cDNA library. From a particular tissue or specimen, mRNA is isolated and cloned into an appropriate vector, which is then plated in a manner so that the progeny of individual vectors bearing the clone of one cDNA sequence can be separately identified. A replica of such a plate is then probed, often with a labeled DNA oligomer selected to hybridize with the cDNA representing the gene of interest. Thereby, those colonies bearing the cDNA of interest are found and isolated, the cDNA harvested and subject to sequencing. Sequencing can then be done by the Sanger dideoxy chain termination method (Sanger et al., 1977, "DNA sequencing with chain terminating inhibitors", Proc. Natl. Acad. Sci. USA 74(12):5463-5467) applied to inserts so isolated.
The DNA oligomer probes for the unknown gene used for colony selection are synthesized to hybridize, preferably, only with the cDNA for the gene of interest. One manner of achieving this specificity is to start with the protein product of the gene of interest. If a partial sequence of 5 to 10-mer peptide fragment from an active region of this protein can be determined, corresponding 15 to 30-mer degenerate oligonucleotides can be synthesized which code for this peptide. This collection of degenerate oligonucleotides will typically be sufficient to uniquely identify the corresponding gene. Similarly, any information leading to 15 to 30 long nucleotide subsequences can be used to create a single gene probe.
Another existing method, which searches for a known gene in a cDNA or gDNA prepared from a tissue sample, also uses single gene or single sequence probes which are complementary to unique subsequences of the already known gene sequences. For example, the expression of a particular oncogene in sample can be determined by probing tissue derived cDNA with a probe derived from a subsequence of the oncogene's expressed sequence tag. Similarly the presence of a rare or difficult to culture pathogen, such as the TB bacillus or the HIV, can be determined by probing gDNA with a hybridization probe specific to a gene of the pathogen. The heterozygous presence of a mutant allele in a phenotypically normal individual, or its homozygous presence in a fetus, can be determined by probing with an allele specific probe complementary only to the mutant allele (See, e.g., Guo et al., 1994, Nucleic Acid Research, 22:5456-65).
All existing methods using single gene probes, of which the preceding examples are typical, if applied to determine all genes expressed in a given tissue sample, would require many thousands to tens of thousands of individual probes. It is estimated a single human cell typically expresses approximately to 15,000 to 15,000 genes simultaneously and that the most complex tissue, e.g., the brain, can express up to half the human genome (Liang et al., 1992, "Differential Display of Eukaryotic Messenger RNA by Means of the Polymerase Chain Reaction, Science, 257:967-971). Such an application requiring such a number of probes is clearly too cumbersome to be economic or, even, practical.
Another class of existing methods, known as sequencing by hybridization ("SBH"), in contrast, use combinatorial probes which are not gene specific (Drmanac et al., 1993, Science 260:1649-52; U.S. Pat. No. 5,202,231, Apr. 13, 1993, to Drmanac et al). An exemplary implementation of SBH to determine an unknown gene requires that a single cDNA clone be probed with all DNA oligomers of a given length, say, for example, all 6-mers. Such a set of all oligomers of a given length synthesized without any selection is called a combinatorial probe library. From knowledge of all hybridization results for a combinatorial library, say all the 4096 6-mer probe results, a partial DNA sequence for the cDNA clone can be reconstructed by algorithmic manipulations. Complete sequences are not determinable because, at least, repeated subsequences cannot be fully determined. SBH adapted to the classification of known genes is called oligomer sequence signatures ("OSS") (Lennon et al., 1991, Trends In Genetics 7(10):314-317). This technique classifies a single clone based on the pattern of probe hits against an entire combinatorial library, or a significant sub-library. It requires that the tissue sample library be arrayed into clones, each clone comprising only one pure sequence from the library. It cannot be applied to mixtures.
These exemplary existing methods are all directed to finding one sequence in an array of clones each expressing a single sequence from a tissue sample. They are not directed to rapid, economical, quantitative, and precise characterization of all the DNA sequences in a mixture of sequences, such as a particular total cellular cDNA or gDNA sample. Their adaptation to such a task would be prohibitive. Determination by sequencing the DNA of a clone, much less an entire sample of thousands of sequences, is not rapid or inexpensive enough for economical and useful diagnostics. Existing probe-based techniques of gene determination or classification, whether the genes are known or unknown, require many thousands of probes, each specific to one possible gene to be observed, or at least thousands or even tens of thousands of probes in a combinatorial library. Further, all of these methods require the sample be arrayed into clones each expressing a single gene of the sample.
In contrast to the prior exemplary existing gene determination and classification techniques, another existing technique, known as differential display, attempts to fingerprint a mixture of expressed genes, as is found in a pooled cDNA library. This fingerprint, however, seeks merely to establish whether two samples are the same or different. No attempt is made to determine the quantitative, or even qualitative, expression of particular, determined genes (Liang et al., 1995, Current Opinions in Immunology 7:274-280; Liang et al., 1992, Science 257:967-71; Welsh et al., 1992, Nucleic Acid Res. 20:4965-70; McClelland et al., 1993, Exs 67:103-15; Lisitsyn, 1993, Science 259:946-50). Differential display uses the polymerase chain reaction ("PCR") to amplify DNA subsequences of various lengths, which are defined by being between the hybridization sites of arbitrarily selected primers. Ideally, the pattern of lengths observed is characteristic of the tissue from which the library was prepared. Typically, one primer used in differential display is oligo(dT) and the other is one or more arbitrary oligonucleotides designed to hybridize within a few hundred base pairs of the poly-dA tail of a cDNA in the library. Thereby, on electrophoretic separation, the amplified fragments of lengths up to a few hundred base pairs should generate bands characteristic and distinctive of the sample. Changes in tissue gene expression may be observed as changes in one or more bands.
Although characteristic banding patterns develop, no attempt is made to link these patterns to the expression of particular genes. The second arbitrary primer cannot be traced to a particular gene. First, the PCR process is less than ideally specific. One to a few base pair ("bp") mismatches ("bubbles") are permitted by the lower stringency annealing step typically used and are tolerated well enough so that a new chain can be initiated by the Taq polymerase, often used in PCR reactions. Second, the location of a single subsequence or its absence is insufficient information to distinguish all expressed genes. Third, length information from the arbitrary primer to the poly-dA tail is generally not found to be characteristic of a sequence due to variations in the processing of the 3' untranslated regions of genes, the variation in the poly-adenylation process and variability in priming to the repetitive sequence at a precise point. Thus, even the bands that are produced often are smeared by the non-specific background sequences present. Also known PCR biases to high G+C content and short sequences further limit the specificity of this method. Thus this technique is generally limited to "fingerprinting" samples for a similarity or dissimilarity determination and is precluded from use in quantitative determination of the differential expression of identifiable genes.
Existing methods for gene or DNA sequence classification or determination are in need of improvement in their ability to perform rapid and economical as well as quantitative and specific determination of the components of a cDNA mixture prepared from a tissue sample. The preceding background review identifies the deficiencies of several exemplary existing methods.