A. Field of the Invention
This invention relates to deoxyribonucleic acid (DNA) sequencing. More specifically, this invention provides a method and system for improving the accuracy of DNA sequencing and error probability estimation through application of a mathematical model to the analysis of electropherograms.
B. General Description of the Area of Research
With the advent of the Human Genome Project and its massive undertaking to sequence the entire human genome, researchers have been turning to automated DNA sequencers to process vast amounts of DNA sequence information. DNA, or deoxyribonucleic acid, is one of the most important information-carrying molecules in cells. DNA is composed of four different types of monomers, called nucleotides, which are in turn composed of bases linked with a sugar and a phosphate group. The four bases are adenine (A), cytosine (C), guanine (G), and thymine (T). The original state of a DNA fragment is a double helix of two antiparallel chains with complementary nucleotide sequences. The coded information of a DNA sequence is determined by the order of the four bases in either of these chains.
A common approach to obtaining information from DNA is the Sanger method. In this method, single-stranded DNA fragments are used as templates from which a series of nested subfragment sets is generated. (F.Sanger, et al., xe2x80x9cDNA Sequencing With Chain-Terminating Inhibitorsxe2x80x9d, Proceedings of the National Academy of Sciences of the USA, vol. 74, pp.5463-5467 (1977)). The subfragments start at the same end of the template, and a fraction of the subfragments of each length are caused to terminate by incorporation of chemically modified bases, thereby forming subfragment sets in increments of one nucleotide. In the popular xe2x80x9cfour-colorxe2x80x9d method, the terminating bases are labeled by one of four fluorescent dyes specific to the terminating base type, A, C, G or T. (L. M. Smith, et al., xe2x80x9cFluorescence Detection In Automated DNA Sequence Analysisxe2x80x9d, Nature, vol. 321, pp. 674-679 (1986)). The resulting mixture of sets of subfragments represents all of the possible sublengths of the template, with each set of subfragments labeled by a fluorescent dye corresponding to its terminating base type. To determine the sequence of the template, the subfragments are sorted by length using electrophoresis. In this process shorter subfragments migrate faster than longer subfragments in an applied electric field. Because subfragments are created in increments of one nucleotide, they pass through an electrophoretic cell one at a time in the order of the nucleotides in the template. The terminating base types are identified by the wavelength at which they fluoresce. A real-time fluorescent detection of migrating bands of the subfragments is then performed as the subfragments pass through a detection zone. The light collected is processed with a set of spectral filters that attempt to isolate the signals from the four dyes.
In automated DNA sequencing, these raw signals are then analyzed by a signal processing software. The steps of signal processing may include downsampling of the data to 1 Hz if necessary, primer data removal, baseline adjustment, noise filtering, multicomponent transformation, dye mobility shift correction, signal normalization, etc. (see, e.g., M. C. Giddings, et al., xe2x80x9cA Software System For Data Analysis In Automated DNA Sequencingxe2x80x9d, Genome Research, vol. 8, pp. 644-665 (1998)). Processing the raw data produces analyzed electropherograms with clearly defined peaks. The analyzed data in the form of electropherograms are then processed using a base calling program. The base calling program infers a sequence of bases in the DNA fragment. This sequence of bases is also referred to as a read and is usually about 1,000 bases long. Not all of the called bases are used in subsequent processing. The statistically averaged error produced by any base calling program is usually low, i.e., below 1%, for bases located near middle of a read and increases significantly toward the beginning and, especially, toward the end of a read. To characterize a reliable, or high quality part of a read, a threshold of 1% base calling error is commonly accepted. That is, only that part of the read having an average base calling error of 1% or less will be subsequently used. Alternatively, this may be characterized in terms of the quality values assigned to bases, where the quality is the measure of reliability of the base call. According to a commonly used definition of quality values, a quality value of 20 or higher corresponds to a probability of error of 1% or less. In practice, when sequencing, the correct sequence is not known in advance, so reliable predictions of quality values for newly sequenced fragments based on previous training or calibration on a data set with a known correct sequence are desirable.
C. Prior Art
1. ABI Base Caller
The ABI Base Caller is a part of DNA Sequencing Analysis software produced by Applied Biosystems of Foster City, Calif. This program takes raw electropherograms as input, processes them to produce analyzed electropherograms having well defined and evenly spaced peaks, and then detects and classifies peaks in the analyzed electropherograms as a sequence of bases. The program outputs the results to a binary file called an ABI sample file. The output includes the raw and analyzed electropherograms for each of the four traces, the array of called bases and the array of locations assigned to the bases in an electropherogram. The output does not include an estimate of quality values, because the ABI Base Caller program does not estimate the reliability of base calls.
The ABI Base Caller was chronologically the first and is still one of the best base calling programs available. The base calls produced by the ABI Base Caller, however, are not very accurate toward the end of a read, where peaks in an analyzed electropherogram become wider and significantly overlap. In this part of the read, the ABI Base Caller produces a considerable amount of mismatch errors, unknown base calls that are denoted as N""s, and overlooks some base calls resulting in deletion errors.
2. Phred
Phred is the first base calling software program to achieve a lower error rate than the ABI software, and is especially effective at the end of a read. Phred takes analyzed electropherograms produced by the ABI Base Caller as input, calls the bases and assigns quality values to the called bases. (see B. Ewing, et al., xe2x80x9cBase Calling Of Automated Sequencer Traces Using Phred. I. Accuracy Assessmentxe2x80x9d, Genome Research, vol. 8(3), pp. 175-185 (1998); B. Ewing and P. Green, xe2x80x9cBase-Calling Of Automated Sequencer Traces Using Phred. II. Error Probabilitiesxe2x80x9d, Genome Research, vol. 8(3), pp. 186-194 (1998)).
The base calling procedure in Phred consists of four phases: locating the predicted peaks, locating the observed peaks, matching the observed and predicted peaks, and finding the missed peaks. In the first phase, Phred attempts to find the idealized locations that all of the base peaks that would have occurred in the absence of imperfections in the sequencing reactions, in the electrophoresis process, and in trace processing. The underlying premise of Phred is that under such idealized conditions, each trace consists of evenly spaced, non-overlapping peaks, corresponding to the labeled fragments that terminate at a particular base in the sequenced strand. To find the positions of predicted peaks, Phred first examines the four trace arrays that correspond to each of the four bases to detect the peaks. A detected predicted peak is identified as the location of the maximum value, or, if the maximum does not exist, the midpoint between the inflection points. The processed trace is then scanned to find the regions of uniform peak spacing and the average peak period. The average peak period corresponds to peak-to-peak spacing or inter-peak spacing. This is determined for each of the regions. Phred then uses Fourier methods to find the positions of the predicted peaks in between these regions.
In the second phase, Phred locates all of the observed peaks by scanning the four trace arrays for regions that are concave. According to Phred, the concave part of the trace located between two inflection points is the observed peak. In the third phase, Phred matches observed and predicted peaks by assigning each observed peak to each predicted peak. This is done via alignment of the two lists of peaks using a dynamic programming algorithm. If, upon the completion of the third phase, no suitable observed peak can be assigned to a predicted peak, the corresponding base call is defined to be N, meaning that it is unknown or that it is not assigned.
Once the base calling procedure is completed, Phred assigns quality values to the bases. The quality value is the measure of reliability of a given base call. If P is the probability that the base call is incorrect, then the quality value QV is defined by the expression QV=xe2x88x9210 log P, rounded to the nearest integer. Thus,
P=10xe2x88x921 corresponds to QV=10;
P=10xe2x88x922 corresponds to QV=20;
P=10xe2x88x923 corresponds to QV=30.
etc.
To assign quality values to called bases, Phred trains on established data for which the correct base sequence, referred to as a consensus, is already known. Phred""s training software compares the actual base calls with the consensus to determine the positions at which Phred makes base calling errors. Phred then stores the trace conditions under which a particular peak was incorrectly called. Specifically, for each called base or peak, Phred computes and stores four parameters, peak height ratio in a window of three peaks, peak height ratio in a window of seven peaks, peak spacing ratio in a window of seven peaks and peak resolution. These four parameters are referred to as the trace parameters. These parameters are useful in discriminating base calling errors from correct base calls. For each base, these four parameters are expressed as functions of characteristics of the corresponding peak, as well as the characteristics of several other peaks flanking the current peak. Smaller trace parameter values correspond to higher quality measurements.
Based on this stored information, a text file called a lookup table is generated by Phred""s training software. This file stores information about average base calling errors that corresponds to each given set of the four trace parameter values. The lookup table is provided to users of Phred as a part of the source code and is used when assigning quality values to called bases. Phred""s lookup table is calibrated on data generated by the ABI PRISM(copyright) 377 DNA Sequencer (available from Applied Biosystems of Foster City, Calif.). When Phred is run on data produced by the newer generation DNA sequencers (e.g., ABI PRISM(copyright) 3700 DNA Analyzer), the quality values produced are not very accurate. Specifically, the quality values determined by Phred are lower than those experimentally produced, causing false low quality determinations that result in unused base calls.
D. Inadequacies of the Prior Art
Currently used tools for automated DNA sequencing such as ABI""s Base Caller, Phred and others operate with peak characteristics, such as area, position, height, etc. (see, e.g., A. Berno, xe2x80x9cA Graph Theoretic Approach To The Analysis Of DNA Sequencing Dataxe2x80x9d, Genome Research, vol. 6, pp.80-91 (1996); M. C. Giddings, et al., xe2x80x9cA Software System For Data Analysis In Automated DNA Sequencingxe2x80x9d, Genome Research, vol. 8, pp. 644-665 (1998)). These characteristics are used to discern true peaks from noise before performing base calling, as well as to compute important trace parameters, which are used to calibrate quality values assigned to called bases. To compute the characteristics of a peak of a given dye color, the signal from the trace of the particular color is used. Because this signal is actually a sum of individual signals from all peaks of the particular color, the presence or absence of other peaks of the same color in close vicinity of a given peak significantly affects the characteristics computed for a given peak. For example, the apparent height of a given peak computed as the total signal at the peak""s position would generally overestimate the true, or intrinsic height of this peak. This occurs because other peaks of the same color which overlap with the current peak amplify the signal at the current peak""s position. These characteristics, therefore, are not very accurate and should be considered as xe2x80x9capparentxe2x80x9d. The phenomenon of peak overlapping is particularly noticeable toward the end of a trace, which corresponds to the end of a fragment. Toward the end of a trace, peaks become wider and overlap, so that the difference between the apparent and intrinsic peak characteristics becomes significant. And it is toward the end of a trace where the currently used, prior art base calling systems fail to provide accurate results.
FIG. 1 (prior art) illustrates a sample electropherogram showing inflection points, apparent peaks and apparent peak positions as identified by Phred. Phred uses inflection points to detect peaks which are more accurately described as apparent peaks. A peak is defined as the concave part of an electropherogram, E, located between two inflection points. Inflection points i1 through i8 are used to identify peaks P1, P2, P3 and P4. Only the shaded areas in FIG. 1 are regarded as the peak area and are used in subsequent processing. The rest of the electropherogram, where the signal may still be high, is ignored by Phred, therefore ignoring this useful information. This may be a considerable loss if two or more peaks overlap, such as with peaks P2, P3 and P4.
The apparent peak characteristics are computed in Phred in the following way. Apparent peak area is the area below the apparent peak. Apparent peak position is the position in an electropherogram which bisects the apparent peak""s area and is shown as POS1, POS2, POS3 and POS4. Apparent peak height is the signal at the apparent peak position. Apparent peak spacing is the distance between the positions of one apparent peak and a previous apparent peak in an electropherogram.
Partially due to inadequate processing of traces and peaks, toward the end of a trace, the commonly used base calling programs, such as Phred and the ABI Base Caller, produce a considerable number of unknown base calls designated as N. This happens, for example, when two or more good peak candidates for base calling are found at a given position in the trace or, conversely, when no peaks are found near the position where a base call should be made. Because Ns are not present in the correct sequence, they may be regarded as mismatch errors. From a practical point of view, it is desirable to re-call Ns to a best guess base call and express the uncertainty associated with the base through an assigned quality value.
Another form of inadequate trace processing results in missed bases, or deletion errors, produced, particularly, by the ABI Base Caller towards the end of read. This may occur when two or more overlapped peaks are mistakenly identified and called as one. To avoid this type of error, Phred splits excessively wide peaks into narrower peaks. However, Phred""s peak splitting procedure may be inaccurate and may result in undesired consequences which include the splitting and subsequent calling of noise peaks, such as dye blobs, the peaks corresponding to a pure dye, rather than to any DNA subfragment.
Inadequate processing of peaks not only affects base calling, but may also result in assigning inaccurate quality values to bases. The apparent peak characteristics and apparent trace parameters that are used by prior art systems such as Phred to assign a quality value to a called peak depend significantly on the peak""s environment. That is, whether there are other peaks of the same dye color in a close vicinity of the current peak. Because the environment of an average peak may vary significantly from one dataset to another, quality values assigned to bases of one dataset based on a quality value calibration derived from another dataset may produce inaccurate results. Specifically, Phred, which is calibrated using the data produced by ABI PRISM(copyright) 377 DNA Sequencers, may underestimate quality values of bases generated by the ABI PRISM(copyright) 3700 DNA Analyzer.
Even if proper calibration is used, Phred usually assigns low quality values to unresolved bases. Unresolved bases are typically present near the end of a read. A base is unresolved if it is called as N or if for at least one of its neighboring bases, there is no point between two corresponding peaks at which the signal is less than the signal at each peak. Thus, with an improved base calling and peak processing procedure that includes the re-calling of Ns and resolving of unresolved bases, higher and more accurate quality values may be assigned to bases. This improved processing may also produce more useful data by increasing the number of bases with sufficiently high quality values, such as 20 or better.
The present invention provides a novel approach to peak processing using intrinsic peak characteristics in computations related to base calling and quality value calibration instead of the apparent characteristics that are used in prior art systems. The use of intrinsic characteristics allows for more accurate base calls and quality value assignments. The method of the present invention is based on the application of a mathematical model of peak electrophoresis to discern true peaks from noise and to process rows of overlapping peaks of a given color. The processing of the present invention includes three steps: peak detection, peak expansion and peak resolution. The first step employs inflection points to find individual peaks. The peak boundaries are then expanded by scanning the trace to the left from the left boundary of each peak and scanning the trace to the right from the right boundary of each peak. Finally, the groups of overlapping peaks are resolved into individual intrinsic peaks. For single peaks, the resolution step consists of simply fitting the data peak with the model. For multiple peaks, resolution requires a more sophisticated iterative procedure.
The mathematical model of peak electrophoresis is further used to discern true peaks from noise. The expected shape of a current peak is computed as an average shape of a group of previous good peaks that were refined according to the model. By comparing an expected shape of a current peak with the shape of a current peak, a determination is made whether the current peak is a true peak or is noise so that only true peaks are called.
The present invention also resolves wide peaks into descendent peaks more accurately than prior art systems. If the width of a current peak is greater than a system specified expected width, the peak will be resolved into two or more intrinsic peaks having model peak characteristics. Certain adjustable parameters of these descendent peaks are selected so as to fit the observed data. In this way, the sum of the signals of the multiple superimposed intrinsic peaks are approximately equal to the apparent signal of the original parent peak.
The present invention increases the usefulness of DNA data by calling bases which are unassigned and unassignable by prior art systems. By re-calling originally unknown bases, inserting new bases and resolving overlapping peaks, the useful read length of bases with high quality values is extended.