DNA sequencing is one of the cornerstone analytical techniques of modern molecular biology. The development of reliable methods for sequencing has lead to great advances in the understanding of the organization of genetic information and has made possible the manipulations of genetic material (i.e., genetic engineering).
There are currently two general methods for sequencing DNA: the Maxam-Gilbert chemical degradation method [A. M. Maxam et al., Meth. in Enzvm., Vol. 65, 499-559 (1980)]and the Sanger dideoxy chain termination method [F. Sanger, et al., Proc. Nat. Acad. Sci. USA. Vol. 74, 5463-5467 (1977)]. A common feature of these two techniques is the generation of a set of DNA fragments which are analyzed by electrophoresis. The techniques differ in the methods used to prepare these fragments.
With Sanger's technique, DNA fragments are produced through partial enzymatic copying (i.e., synthesis) of the piece of DNA to be sequenced. In the most common version, the piece of DNA to be sequenced is inserted, using standard techniques, into a "sequencing vector", a large, circular, single-stranded piece of DNA such as the bacteriophage M13. This becomes the template for the copying process. A short piece of DNA with its sequence complementary to a region of the template just upstream from the insert is annealed to the template to serve as a primer for the synthesis. In the presence of the four natural deoxyribonucleoside triphosphates (dNTP's), a DNA polymerase will extend the primer from the 3'-end to produce a complementary copy of the template in the region of the insert. To produce a complete set of sequencing fragments, four reactions are run in parallel, each containing the four dNTP's along with a single dideoxyribonucleoside triphosphate (ddNTP) terminator, one for each base. (.sup.32 P-Labeled or fluorophore-labelled dNTP is added to afford labeled fragments.) If a dNTP is incorporated by the polymerase, chain extension can continue. If the corresponding ddNTP is selected, the chain is terminated. The ratio of ddNTP to dNTP's is adjusted to generate DNA fragments of appropriate lengths. Each of the four reaction mixtures will, thus, contain a distribution of fragments with the same dideoxynucleoside residue at the 3'-terminus and a primer-defined 5'-terminus.
In both the Sanger and Maxam-Gilbert methods, base sequence information which generally cannot be directly determined by physical methods has been converted into chain-length information which can be determined. This determination can be accomplished through electrophoretic separation. Under denaturing conditions (high temperature, urea present, etc.), short DNA fragments migrate as if they were stiff rods. If a gel matrix is employed for the electrophoresis, the DNA fragments will be sorted by size. The single-base resolution required for sequencing can usually be obtained for DNA fragments containing up to several hundred bases.
To determine a full sequence, the four sets of fragments produced by either Maxam-Gilbert or Sanger methodology are subjected to electrophoresis. This results in the fragments being spatially resolved along the length of the gel. One method of discriminating the dyes (which replace the .sup.32 p label) and using this information to determine DNA sequences is described in the Prober et al. application and it is available in a commercial instrument known as the Genesis#2000 available from E. I. du Pont de Nemours and Company, Wilmington, Delaware. The Genesis.TM. system for sequencing DNA, comprising a means for detecting the presence of radiant energy from closely-related yet distinguishable reporters or labels, which are covalently attached to compounds which function as chain terminating nucleotides in a modified Sanger DNA chain elongation method. Distinguishable fluorescent reporters are attached to each of the four dideoxynucleotide bases represented in Sanger DNA sequencing reactions, i.e., dideoxynucleotides of adenine, guanine, cytosine, and thymine. These reporter-labeled chain terminating reagents are substituted for unlabeled chain terminators in the traditional Sanger method and are combined in reactions with the corresponding deoxynucleotides, an appropriate primer, template, and polymerase. The resulting mixture contains DNA fragments of varying length that differ from each other by one base which terminate on the 3' end with uniquely labeled chain terminators corresponding to one of the four DNA bases. This new labeling method allows elimination of the customary radioactive label contained in one of the deoxynucleotides of the traditional Sanger method.
Detection of these reporter labels can be accomplished with two stationary photomultiplier tubes (PMT's) which receive differing wavelength bands of fluorescent emissions from laser-stimulated reporters attached to chain terminators on DNA fragments. These fragments can be electrophoretically separated in space and/or time to move along an axis perpendicular to the sensing area of the PMT's. The fluorescent emissions first pass through a dichroic filter having both a transmission and reflection characteristic, placed so as to direct one characteristic (transmission) to one PMT, and the other characteristic (reflection) to the other PMT. In this manner, different digital signals are created in each PMT that can be ratioed to produce a third signal that is unique to a given fluorescent reporter, even if a series of fluorescent reporters have closely spaced emission wavelengths. This system is capable of detecting reporters which are all efficiently excited by a single laser line, such as 488 nm, and which have closely spaced emissions whose maxima usually are different from each other by only 5 to 7 nm. Therefore, the sequential base assignments in a DNA strand of interest can be made on the basis of the unique ratio derived for each of the four reporter-labeled chain terminators which correspond to each of the four bases in DNA.
While the base information is contained in fluorescent labels in the Genesis.TM.2000 unit, it is noted that the information could also be contained in a colorimetric label (S. Beck, Anal. Biochem. 164 (2) 514-520 (1987)), chemiluminescent (S. Beck, Nucleic Acids 17 5115-5123 (1989)) or other signal.
The Genesis.TM. DNA sequencer is designed to take advantage of the dideoxy chain termination chemistry. In order to employ this chemistry, it was necessary to use four chemically-similar dyes to distinguish the four bases A, C, G, and T. This selection of dyes led to a two-channel detection scheme, in which the ratio of two signals is used to determine which base has passed the detector. When peaks are well-resolved and noise-free, the ratiometric signals are easy to interpret (FIG. 1). However, to maximize the amount of sequence information that can be obtained from each run, it is necessary to accurately interpret the two-channel signal under conditions of poor peak resolution and significant noise.
The methods for analysis of two-channel data under these conditions differ from those used to process conventional electrophoretograms and chromatograms. The output of the analysis described here is a sequence of base identifications, A, C, G, or T, while in chromatography, the desired output is typically a list of peak positions and areas. Chromatographic processes generally do not involve two detector signals coupled by one of four ratios. This relationship between the two signals is a special property of the sequencer described in the Prober et al. patent application. Computational efficiency is a more important consideration for sequencing than for chromatography. In chromatography, useful results can be obtained by performing extensive computations on two or three peaks; in sequencing, it may be necessary to analyze 300 to 600 peaks.
The ratiometric scheme of Prober et al also presents a signal interpretation problem different from that of other DNA sequencers. Sequencers employing primer chemistry are described in [L. M. Smith et al. Nucleic Acids Res. 2399-2412 (1985) and W. Ansorge et al. J. Biochem. Biophys. Meth. 13 315-323 (1986)]. These sequencers employ four signal channels, one for each base. Other sequencers, such as that described by Kambara et al. [H. Kambara et al., Biotechnology 6 816-821 (1988)], employ one signal in each of four electrophoresis lanes These systems employ yet another class of data analysis methods, since the results from four separate lanes must be registered, or aligned, in the proper time sequence.
Although modifications to standard methods are necessary, analysis methods that are applicable to the processing of two-channel DNA sequencing data make use of results in the chromatography literature. These are reviewed below.
Digital smoothing [A. Savitsky et al., Anal. Chem. 34 1627-1639 (1964)]can be applied to remove noise from fluorescent signals. Digital differentiation, also described by Savitsky and Golay, can be used to aid in peak finding, but does not provide a means for interpreting ratiometric data under conditions of poor resolution.
Digital filtering [L. C. Allen et al., J. Chem Phys. 40 3135-3141 (1964)]can be used to improve peak-finding accuracy where there is a priori information regarding the peak shape.
Standard chromatographic baseline removal techniques [J. F. Muldoon et al., "On-Line Computer Methods for Area Allocation of Unresolved Chromatograph Peaks", Pittsburgh Conference on Analytical Chemistry and Applied Spectroscopy", March 7, 1969, Cleveland, OH; K. J. Burkhardt, "General Purpose Chromatograph Peak Integration Program", IBM Contributed Program Library No. 1130-17.3.002, IBM Corporation (1968)]can give fair performance under some conditions. However, sequencing signals of interest have a larger dynamic range and poorer resolution than is generally accepted for chromatograms. When a small peak occurs next to a large one, baseline removal methods can introduce substantial error in ratio calculation, and thereby result in sequencing errors.
Digital filtering and deconvolution [P. Jansson, Deconvolution with Applications in Spectroscopy, Academic Press (1984)]are methods used to enhance the resolution of chromatograms. Both were unsuccessful in enhancing sequencing performance of the ratiometric scheme of Prober et al. Both methods tend to amplify noise in proportion to their ability to enhance resolution; significant resolution enhancement came along with an unacceptable signal-to-noise ratio. Both methods gave oscillating signals which produced peaks in the waveform where none were supposed to exist. These additional peaks can be erroneously interpreted as extra bases inserted into a sequence. Such frequent insertion errors are unacceptable, since they alter the entire biological meaning of a sequence that encodes a protein [C. I. Davern, Genetics: Readinos from Scientific America, W. H. Freeman & Co., Inc. 142-149 (1986)]. Deconvolution introduced further error into the ratio-determining process, since the signal peak shape varies during the run, causing the deconvolving "kernel function" to become inaccurate.
The sequencer described in Prober et al utilizes a combination of the methods above for signal interpretation. Signal processing began with a 9-point Savitsky-Golay smoothing of both detector channels, to reduce noise. The sum of the two channels were then passed through a digital filter, which approximated a smoothing, second-derivative operator. A positive-going peak in the resultant data was interpreted as a peak in the original signals, corresponding to a DNA base. The zero-crossings of the resultant peak were interpreted as the inflection points of the corresponding peak in the original signals. A straight line between the inflection points was taken as the peak baseline, and the ratio of areas above the baseline in the two channels of resultant data was interpreted to determine the base sequence.
This method suffered from a number of disadvantages. The determination of baseline and peak area was subject to interference from adjacent peaks, which would be 10-100-fold greater in size than the peak of interest. Additionally, the method gave no reliable indication of such interference, so that there was no way to "flag" potentially erroneous base calls. These phenomena combined to limit the useful run length of the sequencer to approximately 300 bases, after which point limited resolution and signal-to-noise ratio led to an unacceptable sequence error rate. Furthermore, within the first 300 bases, there is a persistent phenomenon of anomalously high mobility of DNA fragments ending in a GC sequence. This causes C peaks to move closer to preceding G peaks, resulting in poor resolution of the pair and additional base calling errors.