1. Field of the Invention
The present invention relates generally to the field of nucleic acid analysis. More particularly, it concerns the sequencing and mapping of double-stranded nucleic acid templates.
2. Description of Related Art
An aggressive research effort to sequence the entire human genome is proceeding in the laboratories of genetic researchers throughout the country. The project is called the Human Genome Project (HGP). It is a daunting task given that it involves the complete characterization of the archetypal human genome sequence which comprises 3.times.10.sup.9 DNA nucleotide base pairs. Early estimates for completing the task within fifteen years hinged on the expectation that new technology would be developed in response to the pressing need for faster methods of DNA sequencing and improved DNA mapping techniques.
Currently physical mapping is used to identify overlapping clones of DNA so that all of the DNA in a particular region can be sequenced or otherwise studied. There are two basic techniques of physical mapping. First, all candidate overlapping clones can be restricted with a series of restriction enzymes and the restriction fragments separated by gel electrophoresis. Overlapping clones will share some DNA sequences and thus some common restriction fragments. By comparing the restriction fragment lengths from a number of clones, the extent of overlap between any two clones can be determined. This process is very tedious and can only evaluate a limited number of candidate clones. Second, if a large number of sequence tagged sites are known in the region studied, the DNA from those sequence tagged sites can be labeled and hybridized to the candidate clones. Clones that hybridize to the same sequence tagged sites are identified as overlapping. If many sequence tagged sites are shared between two clones, it is assumed that the overlap is extensive. Sequence tagged sites give a lot of information from a limited number of hybridization reaction, however, most regions of most genomes do not have extensive sequence tagged site resources. Both methods suffer from lack of direct correspondence between the sequence and the restriction sites or sequence tagged site locations.
Current DNA sequencing approaches generally incorporate the fundamentals of either the Sanger sequencing method or the Maxam and Gilbert sequencing method, two techniques that were first introduced in the 1970's (Sanger et al., 1977; Maxam and Gilbert, 1977). In the Sanger method, a short oligonucleotide or primer is annealed to a single-stranded template containing the DNA to be sequenced. The primer provides a 3' hydroxyl group which allows the polymerization of a chain of DNA when a polymerase enzyme and dNTPs are provided. The Sanger method is an enzymatic reaction that utilizes chain-terminating dideoxynucleotides (ddNTPs). ddNTPs are chain-terminating because they lack a 3'-hydroxyl residue which prevents formation of a phosphodiester bond with a succeeding deoxyribonucleotide (dNTP). A small amount of one ddNTP is included with the four conventional dNTPs in a polymerization reaction. Polymerization or DNA synthesis is catalyzed by a DNA polymerase. There is competition between extension of the chain by incorporation of the conventional dNTPs and termination of the chain by incorporation of a ddNTP.
The original version of the Sanger method utilized the E. coli DNA polymerase I ("pol I"), which has a polymerization activity, a 3'-5' exonuclease proofreading activity, and a 5'-3' exonuclease activity. Later, an improvement to the method was made by using Klenow fragment instead of pol I; Klenow lacks the 5'-3' exonuclease activity that is detrimental to the sequencing reaction because it leads to partial degradation of template and product DNA. The Klenow fragment has several limitations when used for enzymatic sequencing. One limitation is the low processivity of the enzyme, which generates a high background of fragments that terminate by the random dissociation of the enzyme from the template rather than by the desired termination due to incorporation of a ddNTP. The low processivity also means that the enzyme cannot be used to sequence nucleotides that appear more than .about.250 nucleotides from the 5' end of the primer. A second limitation is that Klenow cannot efficiently utilize templates which have homopolymer tracts or regions of high secondary structure. The problems caused by secondary structure in the template can be reduced by running the polymerization reaction at 55.degree. C. (Gomer and Firtel, 1985).
Improvements to the original Sanger method include the use of polymerases other than the Klenow fragment. Reverse transcriptase has been used to sequence templates that have homopolymeric tracts (Karanthanasis, 1982; Graham et al., 1986). Reverse transcriptase is somewhat better than the Klenow enzyme at utilizing templates containing homopolymer tracts.
The use of a modified T7 DNA polymerase (Sequenase.TM.) was a significant improvement to the Sanger method (Sambrook et al., 1989; Hunkapiller, 1991). T7 DNA polymerase does not have any inherent 5'-3' exonuclease activity and has a reduced selectivity against incorporation of ddNTP. However, the 3'-5' exonuclease activity leads to degradation of some of the oligonucleotide primers. Sequenase.TM. is a chemically-modified T7 DNA polymerase that has reduced 3' to 5' exonuclease activity (Tabor et al., 1987). Sequenase.TM. version 2.0 is a genetically engineered form of the T7 polymerase which completely lacks 3' to 5' exonuclease activity. Sequenase.TM. has a very high processivity and high rate of polymerization. It can efficiently incorporate nucleotide analogs such as dITP and 7-deaza-dGTP which are used to resolve regions of compression in sequencing gels. In regions of DNA containing a high G+C content, Hoogsteen bond formation can occur which leads to compressions in the DNA. These compressions result in aberrant migration patterns of oligonucleotide strands on sequencing gels. Because these base analogs pair weakly with conventional nucleotides, intrastrand secondary structures during electrophoresis are alleviated. In contrast, Klenow does not incorporate these analogs as efficiently.
The use of Taq DNA polymerase and mutants thereof is a more recent addition to the improvements of the Sanger method (U.S. Pat. No. 5,075,216). Taq polymerase is a thermostable enzyme which works efficiently at 70-75.degree. C. The ability to catalyze DNA synthesis at elevated temperature makes Taq polymerase useful for sequencing templates which have extensive secondary structures at 37.degree. C. (the standard temperature used for Klenow and Sequenase.TM. reactions). Taq polymerase, like Sequenase.TM., has a high degree of processivity and like Sequenase 2.0, it lacks 3' to 5' nuclease activity. The thermal stability of Taq and related enzymes (such as Tth and Thermosequenase.TM.) provides an advantage over T7 polymerase (and all mutants thereof) in that these thermally stable enzymes can be used for cycle sequencing which amplifies the DNA during the sequencing reaction, thus allowing sequencing to be performed on smaller amounts of DNA. Optimization of the use of Taq in the standard Sanger method has focused on modifying Taq to eliminate the intrinsic 5'-3' exonuclease activity and to increase its ability to incorporate ddNTPs (EP 0 655 506 B1).
Both the Sanger and the Maxim-Gilbert methods produce populations of radiolabelled or fluorescently labeled polynucleotides of differing lengths which are separated according to size by polyacrylamide gel electrophoresis (PAGE). The nucleotide sequence is determined by analyzing the pattern of size-separated radiolabelled polynucleotides in the gel. The Maxim-Gilbert method involves degrading DNA at a specific base using chemical reagents. The DNA strands terminating at a particular base are denatured and electrophoresed to determine the positions of the particular base. By combining the information from fragments terminating at different bases or combinations of bases the entire DNA sequence can be reconstructed. However, the Maxim-Gilbert method involves dangerous chemicals, and is time- and labor-intensive. Thus, it is no longer used for most applications.
The current limitations to conventional applications of the Sanger method include 1) the limited resolving power of polyacrylamide gel electrophoresis, 2) the formation of intermolecular and intramolecular secondary structure of the denatured template in the reaction mixture, which can cause any of the polymerases to prematurely terminate synthesis at specific sites or misincorporate ddNTPs at inappropriate sites, 3) secondary structure of the DNA on the sequencing gels can give rise to compressions of the electrophoretic ladder at specific locations in the sequence, 4) cleavage of the template, primers and products with the 5'-3' or 3'-5' exonuclease activities in the polymerases, and 5) mispriming of synthesis due to hybridization of the oligonucleotide primers to multiple sites on the denatured template DNA. The formation of intermolecular and intramolecular secondary structure produces artificial terminations that are incorrectly "read" as the wrong base, gives rise to bands across four lanes (BAFLs) that produce ambiguities in base reading, and decrease the intensity and thus signal-to-noise ratio of the bands. Secondary structure of the DNA on the gels can largely be solved by incorporation of dITP or 7-deaza-dGTP into the synthesized DNA; DNA containing such modified NTPs is less likely to form urea-resistant secondary structure during electrophoresis. Cleavage of the template, primers or products leads to reduction in intensity of bands terminating at the correct positions and increase the background. Mispriming gives rise to background in the gel lanes.
The net result is that, although the inherent resolution of polyacrylamide gel electrophoresis alone is as much as 1000 nucleotides, it is common to only be able to correctly read 400-600 nucleotides of a sequence (and sometimes much less) using the conventional Sanger Method, even when using optimized polymerase design and reaction conditions. Some sequences such as repetitive DNA, strings of identical bases (especially guanines, GC-rich sequences and many unique sequences) cannot be sequenced without a high degree of error or uncertainty.
In the absence of any methods to consistently sequence DNA longer than about 1000 bases, investigators must subclone the DNA into small fragments and sequence these small fragments. The procedures for doing this in a logical way are very labor intensive, cannot be automated, and are therefore impractical. The most popular technique for large-scale sequencing, the "shotgun" method, involves cloning and sequencing of hundreds or thousands of overlapping DNA fragments. Many of these methods are automated, but require sequencing 5-10 times as many bases as minimally necessary, leave gaps in the sequence information that must be filled in manually, and have difficulty determining sequences with repetitive DNA.
Thus, the goal of placing rapid sequencing techniques and improved mapping techniques in the hands of many researchers is yet to be achieved. New approaches are needed that eliminate the above-described limitations.