The ability to detect specific target nucleic acid analytes using nucleic acid probe hybridization and nucleic acid amplification methods has many applications. These applications include: nucleic acid sequencing, diagnoses of infectious or genetic diseases or cancers in humans or other animals; identification of viral or microbial contamination in cosmetics, foods, pharmaceuticals or water; and identification or characterization of, or genetic discrimination between individuals, for diagnosis of disease and genetic predisposition to disease, forensic or paternity testing and genetic analyses, for example breeding or engineering stock improvements in plants and animals.
The basis of nucleic acid probe hybridization methods and applications is the specific hybridization of an oligonucleotide or a nucleic acid fragment probe to form a stable, double-stranded hybrid through complementary base-pairing to particular nucleic acid sequence segments in an analyte molecule. Particular nucleic acid sequences may occur in only cells from a species, strain, individual or organism. Sequence specific hybridization of oligonucleotides and their analogs is a fundamental biotechnological process employed in various research, medical, and industrial applications. Specific hybridization by base pairing complementarity is utilized, for example, in identification of disease-related polynucleotides in diagnostic assays, screening of clones for polynucleotides containing a sequence of interest, identification of specific polynucleotides in mixtures of polynucleotides, amplification of specific target polynucleotides by, for example, polymerase chain reaction (PCR) and replicase enzyme mediated techniques, hybridization based histologic tissue staining, as in in situ PCR staining for histopathology, therapeutic blocking of expressed mRNA by anti-sense sequences, and DNA sequencing. For descriptions of these and other methods see for example, Sambrook et al. (1989) Molecular Cloning. A Laboratory Manual, 2nd Edition, Cold Spring Harbor Laboratory, New York; Keller and Manak, DNA Probes (1993)2nd Edition, Stockton Press, New York; Milligan et al. (1993) J. Med. Chem. 36:1923-1937; Drmanac et al. (1993) Science 260:1649-52; Bains (1993) J. DNA Sequencing and Mapping 4: 143-50; U.S. Pat. Nos. 4,683,195 and 4,683,202 to Mullis et al; and U.S. Pat. Nos. 4,483,964 and 4,517,338 to Urdea et al.
Base pairing specific hybridization has been proposed as a method of tracking, retrieving, and identifying compounds labeled with oligonucleotide tags. For example, in multiplex DNA sequencing, oligonucleotide tags are used to identify electrophoretically separated bands on a gel that consist of DNA fragments generated in the same sequencing reaction. DNA fragments from multiple sequencing reactions are thus separated on the same lane of a gel that is then blotted with separate solid phase materials on which the fragment bands from individual sequencing reactions are separately visualized by use of oligonucleotide probes that hybridize to complementary tags specific to the individual reaction (Church et al. (1988) Science 240: 185-88). Other uses of oligonucleotide tags or labels identifiable by hybridization based amplification have been proposed for identifying explosives, potential pollutants, such as crude oil, and currency for prevention and detection of counterfeiting. Dollinger reviews these methods, pages 265-274, in Mullis et al., Ed. (1994) The Polymerase Chain Reaction Birkhauser, Boston. More recently, systems employing oligonucleotide tags have also been proposed as a means of labeling, manipulating and identifying individual molecules in complex combinatorial chemical libraries, for example, as an aid to screening such libraries for drug candidates, Brenner and Lerner (1992) Proc. Natl. Acad. Sci. 89:5381-83; Alper (1994) Science 264:1399-1401; and Needels et al. (1993) Proc. Natl. Acad. Sci. 90: 10700-704.
Recombinant DNA technology has permitted amplification and isolation of short fragments of genomic DNA (from 200 to 500 bp) to obtain a sufficient quantity of material for determination of the nucleotide sequence from a cloned fragment. The sequence is then determined.
Distinguishing among the four nucleotides was historically achieved in two ways: (1) by specific chemical degradation of the DNA fragment at specific nucleotides, in accordance with the Maxam and Gilbert method (Maxam, A. M. and Gilbert, W. (1977) Proc. Natl. Acad. Sci. 74:560); or (2) utilizing the dideoxy sequencing method described by Sanger (Sanger, F., et al. (1977) Proc. Natl. Acad. Sci. 74:5463). The dideoxy sequencing method of Sanger results in termination of polymerization at polymer sequence positions that incorporate the specific dideoxy base instead of the corresponding deoxy base, a probabilistic event, which generates sequence segments of different length. The length of these dideoxy terminated sequence segments is determined by separation on polyacrylamide gels that separate DNA fragments in the range of 1 to 500 bp, differing in length by one nucleotide or more. The length of the terminated nucleotide sequence segments for a reaction employing the dideoxy analog of a given base indicates the positions in the sequence of interest occupied by that base.
Both preceding methods are laborious, with competent laboratories able to sequence approximately 100 bp per person per day. With the use of computers and robotics, sequencing can be accelerated by several orders of magnitude.
Sequencing the entire human genome has been widely discussed. Generally appreciated is that such is possible only in large organized centers at a cost on the order of billions of dollars, and would require at least ten years. For accuracy, three lengths of a genome must be sequenced, because of random formation of cloned fragments of about 500 bp. 10 billion bp could be sequenced in approximately 30 years in a center sequencing about a million base pairs per day. Ten such centers would be required to sequence the entire human genome in several years.
A desire for understanding the genetic basis of disease and a host of other physiological states associated with different gene expression patterns has motivated the development of several approaches to large-scale DNA analysis (Adams et al., Ed. (1994) Adams DNA Sequencing and Analysis, Academic Press, New York). Contemporary analysis techniques for patterns of gene expression include large-scale sequencing, differential display, indexing schemes, subtraction hybridization, hybridization with solid phase arrays of cDNAs or oligonucleotides, and numerous DNA fingerprinting techniques. See, e.g., Lingo et al. (1992) Science 257:967-71; Erlander et al. PCT Pat. App. No. PCT/US94/13041; McClelland et al, U.S. Pat. No. 5,437,975; Unrau et al. Gene (1994) 145:163-69; Schena et al. (1995) Science 270: 467-469; Velculescu et al. (1995) Science 270:484-86.
These methods may be grouped into sequencing by direct analysis of hybridization data per se, and methods that label or tag a sequence segment by hybridization. One important subclass of the tag or label group of techniques employs double stranded oligonucleotide adaptors to classify populations of polynucleotides and/or to identify nucleotides at the termini of polynucleotides, e.g. Unrau et al (1994) supra and U.S. Pat. No. 5,508,169; Sibson, PCT Pat. App. Nos. PCT/GB93/01452 and PCT/GB95/00109; Cantor, U.S. Pat. No. 5,503,980; and Brenner, PCT Pat. App. No. PCT/US95/03678 and U.S. Pat. No. 5,552,278. Adapters employed in the preceding techniques typically have protruding single strands that permit specific hybridization and ligation to polynucleotides having complementary single stranded ends (“sticky overhangs”). Identification or classification may be effected by carrying out the reactions in separate vessels, or by providing secondary labels which identify one or more nucleotides in the protruding strand of the ligated adaptor, for example by hybridization.
Successful implementation of such tagging schemes depends in large part on the success in achieving specific hybridization between analyte sequence and the adaptor-tag, and between a tag or primary probe and its complementary or secondary probe.
In techniques employing base pairing specific nucleic acid hybridization in general, including sequencing by hybridizing tags or labels, for an oligonucleotide tag to successfully identify a substance, the number of false positive and false negative signals must be minimized. Unfortunately, such spurious signals are not uncommon because base pairing and stacking free energies vary widely among nucleotides in a duplex or triplex hybridized structure. Duplexes consisting of a repeated sequence of deoxyadenosine (A) and deoxythymidine (T) (or the RNA analogs, adenosine and thymidine) bound to its complementary nucleic acid sequence, are typically less stable than an equal-length duplex consisting of a repeated sequence of deoxyguanosine (G) and deoxycytidine (C) bound to a complementary or even partially complementary target containing a mismatch. The preceding is widely appreciated, explaining the higher melting temperature (Tm) of GC rich double stranded (DS) sequences compared to DS AT rich sequences. Thus, if a desired compound from a large combinatorial chemical library were tagged with the former oligonucleotide, a significant possibility would exist that under hybridization conditions designed to detect perfectly matched AT-rich duplexes, undesired compounds labeled with the GC-rich oligonucleotide—even in a mismatched duplex—would be detected along with the perfectly matched duplexes consisting of the AT-rich tag.
In the molecular tagging system proposed by Brenner et al. supra, the related problem of mis-hybridizations of closely related (i.e. Sequentially homologous) tags was addressed by employing a so-called “comma-less” code, which ensures that a probe out of register (or frame shifted) with respect to its complementary tag would result in a duplex with one or more mismatches for each of its five or more three-base words, or “codons.” Although reagents, such as tetramethylammonium chloride, are available to negate base-specific stability differences of oligonucleotide duplexes, their effect is often limited and their presence may be incompatible with, or may practically complicate, further manipulations of the hybridized complexes, e.g. amplification by polymerase chain reaction (PCR), or the like.
Analogous problems have unduly complicated the simultaneous use of multiple hybridization probes, for example in analysis of multiple or complex genetic loci, e.g. via multiplex PCR, reverse dot blotting, or the like, or simply in “two-color” hybridization. Therefore, direct sequencing of certain loci, e.g. HLA genes, is advocated as a reliable alternative to indirect methods employing specific hybridization for the identification of genotypes, see, e.g., Gyllensten et al. (1988) Proc. Natl. Acad. Sci. 85:7652-56.
There remains a need in the art for methods for systematically employing a smaller number of hybridizing nucleic acid sequences, while obtaining the same amount of information from the hybridization. There also remains a need to reduce the differences in base pairing energies, especially at sequence positions of interest between different pairs of complementary nucleotide bases.
When hybridization based sequencing, regardless of the specific type is the assay at hand, a larger number of hybridizing probes is required than in processes that employ hybridization for detection by amplification such as PCR based methods.
There remains a need for a method for streamlining the number of probes and experiments required for processes that involve hybridization, and especially for sequencing by hybridization methods, while maintaining these processes as determinate or sequence specific.