DNA sequencing is one of the cornerstone analytical techniques of modern molecular biology. The development of reliable methods for sequencing has lead to great advances in the understanding of the organization of genetic information and has made possible the manipulations of genetic material (i.e., genetic engineering).
There are currently two general methods for sequencing DNA: the Maxam-Gilbert chemical degradation method [A. M. Maxam et al., Meth. in Enzym., Vol. 65, 499-559 (1980)] and the Sanger dideoxy chain termination method [F. Sanger, et al., Proc. Nat. Acad. Sci. USA, Vol. 74, 5463-5467 (1977)]. A common feature of these two techniques is the generation of a set of DNA fragments which are analyzed by electrophoresis. The techniques differ in the methods used to prepare these fragments.
With the Maxam-Gilbert technique, DNA fragments are prepared through base-specific, chemical cleavage of the piece of DNA to be sequenced. The piece of DNA to be sequenced is first 5'-end-labeled with .sup.32 P and then divided into four portions. Each portion is subjected to a different set of chemical treatments designed to cleave DNA at positions adjacent to a given base (or bases). The result is that all labeled fragments will have the same 5'-terminus as the original piece of DNA and will have 3'-termini defined by the positions of cleavage. This treatment is done under conditions which generate DNA fragments which are of convenient lengths for separation by gel electrophoresis.
With Sanger's technique, DNA fragments are produced through partial enzymatic copying (i.e., synthesis) of the piece of DNA to be sequenced. In the most common version, the piece of DNA to be sequenced is inserted, using standard techniques, into a "sequencing vector", a large, circular, single-stranded piece of DNA such as the bacteriophage M13. This becomes the template for the copying process. A short piece of DNA with its sequence complementary to a region of the template just upstream from the insert is annealed to the template to serve as a primer for the synthesis. In the presence of the four natural deoxyribonucleoside triphosphates (dNTP's), a DNA polymerase will extend the primer from the 3'-end to produce a complementary copy of the template in the region of the insert. To produce a complete set of sequencing fragments, four reactions are run in parallel, each containing the four dNTP's along with a single dideoxyribonucleoside triphosphate (ddNTP) terminator, one for each base. (.sup.32 P-Labeled or fluorophore-labelled dNTP is added to afford labeled fragments.) If a dNTP is incorporated by the polymerase, chain extension can continue. If the corresponding ddNTP is selected, the chain is terminated. The ratio of ddNTP to dNTP's is adjusted to generate DNA fragments of appropriate lengths. Each of the four reaction mixtures will, thus, contain a distribution of fragments with the sane dideoxynucleoside residue at the 3'-terminus and a primer-defined 5'-terminus.
In both the Sanger and Maxam-Gilbert methods, base sequence information which generally cannot be directly determined by physical methods has been converted into chain-length information which can be determined. This determination can be accomplished through electrophoretic separation. Under denaturing conditions (high temperature, urea present, etc.), short DNA fragments migrate as if they were stiff rods. If a gel matrix is employed for the electrophoresis, the DNA fragments will be sorted by size. The single-base resolution required for sequencing can usually be obtained for DNA fragments containing up to several hundred bases.
To determine a full sequence, the four sets of fragments produced by either Maxam-Gilbert or Sanger methodology are subjected to electrophoresis. This results in the fragments being spatially resolved along the length of the gel. One method of discriminating the dyes (which replace the .sup.32 p label) and using this information to determine DNA sequences is described in the Prober et. al. application. This system is available in a commercial instrument known as the Genesis.TM. 2000 available from E. I. du Pont de Nemours and Company, Wilmington, Dela. The Genesis.TM. system for sequencing DNA, comprising a means for detecting the presence of radiant energy from closely-related yet distinguishable reporters or labels, which are covalently attached to compounds which function as chain terminating nucleotides in a modified Sanger DNA chain elongation method. Distinguishable fluorescent reporters are attached to each of the four dideoxynucleotide bases represented in Sanger DNA sequencing reactions, i.e., dideoxynucleotides of adenine, guanine, cytosine, and thymine. These reporter-labeled chain terminating reagents are substituted for unlabeled chain terminators in the traditional Sanger method and are combined in reactions with the corresponding deoxynucleotides, an appropriate primer, template, and polymerase. The resulting mixture contains DNA fragments of varying length that differ from each other by one base which terminate on the 3' end with uniquely labeled chain terminators corresponding to one of the four DNA bases. This new labelling method allows elimination of the customary radioactive label contained in one of the deoxynucleotides of the traditional Sanger method.
Detection of these reporter labels can be accomplished with two stationary photomultiplier tubes (PMT's) which receive differing wavelength bands of fluorescent emissions from laser-stimulated reporters attached to chain terminators on DNA fragments. These fragments can be electrophoretically separated in space and/or time to move along an axis perpendicular to the sensing area of the PMT's. The fluorescent emissions first pass through a dichroic or other wavelength selective filter or filters, placed so as to direct one characteristic wavelength to one PMT, and the other characteristic wavelength to the other PMT. In this manner, different digital signals are created in each PMT that can be ratioed to produce a third signal that is unique to a given fluorescent reporter, even if a series of fluorescent reporters have closely spaced emission wavelengths. This system is capable of detecting reporters which are all efficiently excited by a single laser line, such as 488 nm, and which have closely spaced emissions whose maxima usually are different from each other by only 5 to 7 nm. Therefore, the sequential base assignments in a DNA strand of interest can be made on the basis of the unique ratio derived for each of the four reporter-labeled chain terminators which correspond to each of the four bases in DNA.
While the base information is contained in fluorescent labels in the Genesis.TM. 2000 unit, it is noted that the information could also be contained in a colorimetric label (S. Beck, Anal. Biochem. 164 (2) 514-520 (1987)), chemiluminescent (S. Beck, Nucleic Acids Res. 17 5115-5123 (1989)) or other signal.
The Genesis.TM. DNA sequencer is designed to take advantage of the dideoxy chain termination chemistry. In order to employ this chemistry, it was necessary to use four chemically-similar dyes to distinguish the four bases A, C, G, and T. Unless the dyes are carefully chosen and exhaustively evaluated, their electrophoretic mobility may differ in some DNA sequences, leading to a scrambling of sequence information. The four dyes, chosen for similar electrophoretic mobility, had overlapping emission and excitation spectra. The need to distinguish these dyes without the excessive light loss of extremely narrow-band filters led to a two-channel detection scheme, in which the ratio of two signals is used to determine which base has passed the detector. When peaks are well-resolved and noise-free, the ratiometric signals are easy to interpret (FIG. 1). However, to maximize the amount of sequence information that can be obtained from each run, it is necessary to accurately interpret the two-channel signal under conditions of poor peak resolution and significant noise.
The methods for analysis of two-channel data under these conditions differ from those used to process conventional electrophoretograms and chromatograms. The output of the analysis described here is a sequence of base identifications, A, C, G, or T, while in chromatography, the desired output is typically a list of peak positions and areas. Chromatographic processes generally do not involve two detector signals coupled by one of four ratios. This relationship between the two signals is a special property of the sequencer described in the Prober et al. patent application. Computational efficiency is a more important consideration for sequencing than for chromatography. In chromatography, useful results can be obtained by performing extensive computations on two or three peaks; in sequencing, it may be necessary to analyze 300 to 600 peaks.
The ratiometric scheme of Prober et al also presents a signal interpretation problem different from that of other DNA sequencers. Sequencers employing primer chemistry are described in [L. M. Smith et al. Nucleic Acids Res. 13 2399-2412 (1985) and W. Ansorge et al. J. Biochem. Biophys. Meth. 13 315-323 (1986)]. These sequencers employ four signal channels, one for each base. Other sequencers, such as that described by Kambara et al. [H. Kambara et al., Biotechnology 6 816-821 (1988)], employ one signal in each of four electrophoresis lanes. These systems employ yet another class of data analysis methods, since the results from four separate lanes must be registered, or aligned, in the proper time sequence.
In these automated versions of DNA sequencing the reporter may be fluorimetric as is described in the Prober et al. application, colorimetric (S. Beck, Anal. Biochem. 164 (2) 514-520 (1987), chemiluminescent (S. Beck, Nucleic Acids Res. 17 5115-5123 (1989), or of some other type.
Sequencers employing primer chemistry such as (Hunkapiller et al., U.S. Pat. No. 4,811,218) are not so restricted in the selection of dyes that may be used to tag the DNA fragments. These sequencers can employ four signal channels, one for each base, and thus do not require the complex algorithms needed to interpret ratiometric signals. On the other hand, these sequencers cannot enjoy the advantages of terminator chemistry. In particular, primer chemistry requires four separate reaction tubes for each sample to be sequenced, while terminator chemistry requires only one. In addition, primer chemistry is susceptible to errors from "false stops", erroneous signals produced when a polymerass is unable to proceed past a certain point on a DNA strand.
Other sequencers, such as that described by Kambara et al. (H. Kambara et al., Biotechnology 6 816-821 (1988), employ one signal in each of four electrophoresis lanes. This overcomes many of the difficulties encountered with the resolution problem of the prior art automated DNA sequencers. These systems employ yet another class of data analysis methods, since the results from four separate lanes must be registered, or aligned, in the proper time sequence. Once the lanes are registered, the data analysis methods for these sequences can be identical to those of Hunkapiller et al. Proper registration for the lanes is obviously crucial to correct sequence determination. If lanes are improperly registered, the corresponding bases are interpreted out of order. The problem of registration of four lanes is complex because it is combinatorial in nature. For a given pair of closely spaced bands in all four lanes, there are 41=2.times.3.times.4=24 possible orderings of the bands. Only one corresponds to the correct sequence. The registration process can introduce errors in sequence interpretation, and therefore sequencers of the type described by Kambara may produce a smaller amount of accurate sequence information than those of Hunkapiller, given equal resolution and signal-to-noise ratio. Note also that these sequences require not only four reaction tubes, but also four electrophoresis lanes for each DNA sample to be analyzed.