The fundamental role that determining DNA sequences has for the life sciences is evident. Its importance in the human genome project has been discussed and published widely [e.g. J. E. Bishop and M. Waldholz, 1991, Genome. The Story of the Most Astonishing Scientific Adventure of Our Time--The Attempt to Map All Genes in the Human Body, Simon & Schuster, New York].
The current state-of-the-art in DNA sequencing is summarized in recent review articles [e.g. B. Barrell, The FASEB Journal, 5, 40 (1991); G. L. Trainor, Anal. Chem. 62, 418 (1990), and references cited therein]. The most widely used DNA sequencing chemistry is the enzymatic chain termination method [F. Sanger et al., Proc. Natl. Acad. Sci. USA, 74, 5463 (1977)] which has been adopted for several different sequencing strategies. The sequencing reactions are either performed in solution with the use of different DNA polymerases, such as the thermophilic Taq DNA polymerase [M. A. Innes, Proc. Natl. Acad. Sci. USA, 85: 9436 (1988)] or specially modified T7 DNA polymerase ("SEQUENASE") [S. Tabor and C.C. Richardson, Proc. Natl. Acad. Sci. USA, 84, 4767 (1987)], or in conjunction with the use of polymer supports. See for example S. Stahl et al., Nucleic Acids Res., 16, 3025 (1988); M. Uhlen, PCT Application WO 89/09282; Cocuzza et al., PCT Application WO 91/11533; and Jones et al., PCT Application WO 92/03575, incorporated by reference herein.
A central, but at the same time limiting part of almost all sequencing strategies used today is the separation of the base-specifically terminated nested fragment families by polyacrylarnide gel electrophoresis (PAGE). This method is time-consuming and error-prone and can result in ambiguous sequence determinations. As a consequence of the use of PAGE, highly experienced personnel are often required for the interpretation of the sequence ladders obtained by PAGE in order to get reliable results. Automatic sequence readers very often are unable to handle artefacts such as "smiling", compressions, faint ghost bands, etc. This is true for the standard detection methods employing radioactive labeling such as 32.sub.P, 33.sub.P or 35.sub.S, as well as for the so-called Automatic DNA Sequencers (e.g. Applied Biosystems, Millipore, DuPont, Pharmacia) using fluorescent dyes for the detection of the sequencing bands.
Apart from the time factor, the biggest limitations of all methods involving PAGE as an integral part, however, is the generation of reliable sequence information, and the transformation of this information into a computer format to facilitate sophisticated analysis of the sequence data utilizing existing software and DNA sequence and protein sequence data banks.
With standard Sanger sequencing, 200 to 500 bases of unconfirmed sequence information can be obtained in about 24 hours; with automatic DNA sequencers this number can be multiplied by approximately a factor of 10 to 20 due to processing several samples simultaneously. A further increase in throughput can be achieved by employing multiplex DNA sequencing [G. Church et al., Science, 240, 185-188 (1988); Koster et al., Nucleic Acids Res. Symposium Ser. No. 24, 318-21 (1991)] in which, by using a unique tag sequence, several sequencing ladders can be detected, one after the other, from the same PAGE after blotting, UV-crosslinking to a membrane, and hybridizations with specific complementary tag probes. However, this approach is still very laborious, often requires highly skilled personnel and can be hampered by the use of PAGE as a key element of the whole process.
A large scale sequencing project often starts with either a cDNA or genomic library of large DNA fragments inserted in suitable cloning vectors such as cosmid, plasmid (e.g. pUC), phagemid (e.g. pEMBL, pGEM) or single-stranded phage (e.g. M13) vectors [T. Maniatis, E. F. Fritsch and J. Sambrook (1982) Molecular Cloning. A Laboratory Manual. Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y.; Methods in Enzymology, Vol. 101 (1983), Recombinant DNA, Part C; Vol. 153 (1987), Recombinant DNA, Part D; Vol. 154 (1987), Recombinant DNA, Part E; Vol. 155 (1987), Recombinant DNA, Part F and Vol. 152 (1987), Guide to Molecular Cloning Techniques, Academic Press, New York]. Since large DNA fragments currently cannot be sequenced directly in one run because the Sanger sequencing chemistry allows only about 200 to 500 bases to be read at a time, the long DNA fragments have to be cut into shorter pieces which are separately sequenced. In one approach this is done in a fully random manner by using, for example, nonspecific DNAse I digestion, frequently cutting restriction enzymes, or sonification, and sorting by electrophoresis on agarose gels [Methods in Enzymology, supra]. However, this method is time-consuming and often not economical as several sequences are sequenced many times until a contiguous DNA sequence is obtained. Very often the expenditure of work to close the gaps of the total sequence is enormous. Consequently, it is desirable to have a method which allows sequencing of a long DNA fragment in a non-random, i.e. direct, way from one end through to the other. Several strategies have been proposed to achieve this [Methods of Enzymology, supra; S. Henikoff, Gene, 28, 351-59 (1984); S. Henikoff, et al. U.S. Pat. No. 4,843,003; and PCT Application WO 91/12341]. However, none of the currently available sequencing methods provide an acceptable method of sequencing megabase DNA sequences in either a timely or economical manner. The main reason for this stems from the use of PAGE as a central and key element of the overall process.
In PAGE, under denaturing conditions, the nested families of terminated DNA fragments are separated by the different mobilities of DNA chains of different length. A closer inspection, however, reveals that it is not the chain length alone which governs the mobility of DNA chains by PAGE, but there is a significant influence of base composition on the mobility [R. Frank and H. Koster, Nucleic Acids Res., 6, 2069 (1979)]. PAGE, therefore, is not only a very slow, but also an unreliable method for the determination of molecular weights, as DNA fragments of the same length but different sequence/base composition could have different mobilities. Likewise, DNA sequences which have the same mobility could have different sequence/base compositions.
The most reliable way for the determination of the sequence/base composition of a given DNA fragment would, therefore, be to correlate the sequence with its molecular weight. Mass spectrometry is capable of doing this. The enormous advantage of mass spectrometry compared to the above mentioned methods is the speed, which is in the range of seconds per analysis, and the accuracy of mass determination, as well as the possibility to directly read the collected mass data into a computer. The application of mass spectrometry for DNA sequencing has been investigated by several groups [e.g. Methods in Enzymology, Vol. 193: Mass Spectrometry, (J. A. McCloskey, editor), 1990, Academic Press, New York; K. H. Schramm Biomedical Applications of Mass Spectrometry, 34, 203-287 (1990); P. F. Crain Mass Spectrometry Reviews, 9, 505 (1990)].
Most of the attempts to use mass spectrometry to sequence DNA have used stable isotopes for base-specific labeling, as for instance the four sulfur isotopes .sup.32 S, .sup.33 S, .sup.34 S and .sup.36 S. See, for example, Brennan et al., PCT Application WO 89/12694, R. L. Mills U.S. Pat. No. 5,064,754, U.S. Pat. No. 5,002,868, Jacobson et al.; Haan European Patent Application No. A1 0360676. Most of these methods employed the Sanger sequencing chemistry and polyacrylamide gel electrophoresis with some variations, such as capillary zone electrophoresis (CZE), to separate the nested, terminated DNA fragments prior to mass spectrometric analysis, which, jeopardizes, to some extent, the advantages of mass spectrometry.
One advantage of PAGE is that it is a parallel method, i.e. several samples can be analyzed simultaneously (though this is not true for CZE which is a serial method), whereas mass spectrometry allows, in general, only a serial handling of the samples. In U.S. Pat. No. 5,547,835, mass spectrometric DNA sequencing is proposed without the use of PAGE, employing desorption/ionization techniques applicable to larger biopolymers, such as electrospray (ES) [J. B. Fenn et al., J. Phys. Chem., 88, 4451-59 (1984); Fenn et al., PCT Application No. WO 90/14148; and B. Ardrey, Spectroscopy Europe, 4, 10-18 (1992)] and matrix-assisted laser desorption/ionization (MALDI) mass spectrometry [F. Hillenkamp et al., Laser Desorption Mass Spectrometry, Part I: Mechanisms and Techniques and Part II: Performance and Application of MALDI of Large Biomolecules, in Mass Spectrometry in the Biological Sciences: A Tutorial(M. L. Gross, editor), 165-197 (1992), Kluwer Academic Publishers, The Netherlands] which can facilitate determination of DNA sequences by direct measurement of the molecular masses in the mixture of base-specifically terminated nested DNA fragments. By integrating the concept of multiplexing through the use of mass-modified nucleoside triphosphate derivatives, the serial mode of analysis typical for current mass spectrometric methods can be changed to a parallel mode [H. Koster, U.S. Pat. No. 5,547,835, supra].
MALDI and ES mass spectrometry are in some aspects complementary techniques. While ES, using an atmospheric pressure ionization interface (API), can accommodate continuous flow streams from high-performance liquid chromatoghraphs (HPLC) [K. B. Tomer, et al. Biological Mass Spectrometry, 20, 783-88 (1991)] and capillary zone electrophoresis (CZE) [R. D. Smith et al., Anal. Chem., 60, 436-41 (1988)] this is currently not available for MALDI mass spectrometry. On the other hand, MALDI mass spectrometry is less sensitive to buffer salts and other low molecular weight components in the analysis of larger molecules with a TOF mass analyzer [Hillenkamp et al. (1992), supra]; in contrast, ES is very sensitive to by-products of low volatility. While the high mass range in ES mass spectrometry is accessible through the formation of multiply charged molecular ions, this is achieved in MALDI mass spectrometry by applying a time-of-flight (TOF) mass analyzer and the assistance of an appropriate matrix to volatilize the biomolecules. Similar to ES, a thermospray interface has been used to couple HPLC on-line with a mass analyzer. Nucleosides originating from enzymatic hydrolysates have been analyzed using such a configuration [C. G. Edmonds et al. Nucleic Acids Res., 13, 8197-8206 (1985)]. However, Edmonds et al. does not disclose a method for nucleic acid sequencing.
A complementary and completely different approach to determine the DNA sequence of a long DNA fragment would be to progressively degrade the DNA strand using exonucleases from one side,--nucleotide by nucleotide. This method has been proposed by Jett et al. See J. H. Jett et al. J Biomolecular Structure & Dynamics, 7, 301-309 (1989); and J. H. Jett et al. PCT Application No. WO 89/03432. A single molecule of a DNA or RNA fragment is suspended in a moving flow stream and contacted with an exonuclease which cleaves off one nucleotide after the other. Detection of the released nucleotides is accomplished by specifically labeling the four nucleotides with four different fluorescent dyes and involving laser-induced flow cytometric techniques.
However, strategies which use a stepwise enzymatic degradation process can suffer from problems relating to synchronization, i.e. the enzymatic reaction soon comes out of phase. Jett et al., supra, have attempted to address this problem by degrading just one single DNA or RNA molecule by an exonuclease. However, this approach is very hard, as handling a single molecule, keeping it in a moving flow stream, and achieving a sensitivity of detection which clearly identifies one single nucleotide are only some of the very difficult technical problems to be solved. In addition, in using fluorescent tags, the physical detection process for a fluorescent signal involves a time factor difficult to control and the necessity to employ excitation by lasers can cause photo-bleaching of the fluorescent signal. Another problem, which still needs to be resolved, is that DNA/RNA polymerases, which are able to use the four fluorescently labeled NTPs instead of the unmodified counterparts, have not been identified.
The invention described herein addresses most of the problems described above, which are inherent to currently existing DNA sequencing processes, and provides chemistries and systems suitable for high-speed DNA sequencing, a prerequisite for tackling the human genome and other genome sequencing projects.