The availability of large amounts of DNA sequence information has begun to influence the practice of biology. As a result of current large-scale sequencing output, analysis methods are not adequate to keep pace with the burgeoning data. To keep up with this growing demand, improved automation is needed, and it is particularly important that human involvement in sequence data processing be significantly reduced or eliminated. Progress in this respect requires both improved accuracy of the data processing software and reliable accuracy measures to reduce the need for human involvement in error correction and make human review more efficient.
At present, DNA sequencing is typically performed using the enzymatic dideoxy chain-termination method of Sanger (Sanger et al. 1977, Proc. Natl. Acad. Sci. 74:5463-5467) in automated sequencers such as the Applied Biosystems (ABI, Norwalk, Conn.) 3730xl DNA Analyzer, the 3730 DNA Analyzer, the ABI PRISM 3100 Genetic Analyzer, the 3100-Avant Capillary DNA sequencer/genotyper, or the 310 Capillary DNA Sequencer/Genotyper. Such sequencers can produce sequence data listing more than one thousand bases. One starts with a DNA template of interest and an oligonucleotide primer complementary to a specific site on the template strand. For each of the four bases (A, G, C, T), a reaction is carried out in which DNA polymerase synthesizes a population of labeled single-stranded fragments of varying lengths, each of which is complementary to a segment of the template strand and extends from the primer to an occurrence of that base. These fragments are then separated according to length by gel electrophoresis, whereupon their relative sizes, together with the identity of the final base of each fragment, allow the base sequence of the template to be inferred.
In automated sequencing (Smith et al. 1986, Nature 321:674-679), the fragments are labeled with fluorescent dyes attached either to the primer (dye primer chemistry) or to the dideoxy chain-terminating nucleotide (dye terminator chemistry) (Prober et al. 1987, Science 238:336-341). Typically, a different dye is used for each of the four reactions, so that they can be combined and run in a single gel lane (in the case of dye-terminator chemistry, this also allows all four reactions to be carried out in a single tube). For example, one such application employs laser excitation and a cooled CCD (charged coupled device) detector (Kostichka and Smith, U.S. Pat. No. 5,162,654) for the parallel detection of four fluorescently labeled DNA sequencing reactions during their electrophoretic separation in ultrathin (50-100 microns) denaturing polyacrylamide gels (Kostichka et al., Bio/Technology 10:78-81 (1992)). Weiss et al. (U.S. Pat. No. 5,470,710) describes another fluorescence-based sequencing application, using an enzyme linked fluorescence method for the detection of nucleic acid molecules. See also U.S. Pat. No. 6,596,140, which is directed to a multi-channel capillary electrophoresis device and method.
Typically, multiple templates (e.g., 36 or more at a time) are analyzed in separate lanes on the same gel. At the bottom of the gel, a laser excites the fluorescent dyes in the fragments as they pass, and detectors collect the emission intensities at four different wavelengths. The laser and detectors scan the bottom of the gel continuously during electrophoresis in order to build a gel image in which each lane has a ladder-like pattern of bands of four different colors, each band corresponding to the fragments of a particular length.
Computer analysis is then used to convert the gel image to an inferred base sequence (or read) for each template. Typically, this analysis consists of four distinct steps: lane tracking, in which the gel lane boundaries are identified; lane profiling, in which each of the four signals is summed across the lane width to create a profile, or set of “traces”, providing a set of four arrays indicating signal intensities at several thousand uniformly spaced time points during the gel run; trace processing, in which signal processing methods are used to deconvolve and smooth the signal estimates, reduce noise, and correct for dye effects on fragment mobility and for long-range electrophoretic trends; and base-calling, in which the processed traces are translated into a sequence of bases.
As used herein, the term “trace” refers to a time-resolved separation pattern obtained by chromatography for a particular compound, such as a nucleotide. This separation pattern is characterized by a plurality of datapoints in which each respective datapoint in the plurality of datapoints represents a signal amplitude at a position in the separation pattern corresponding to the respective datapoint. The value of a given datapoint is determined by a function of an amount of the compound corresponding to the trace that is sensed by the detector at the point in time represented by the datapoint. In typical nucleic acid sequences, for example, the abundance of a base represented by a trace at each datapoint will vary. Datapoints in which the compound represented by the trace is not present, will typically be assigned relatively small signal amplitudes. Conversely, datapoints in which the compound represented by the trace is present, will typically be assigned relatively large signal amplitudes. Thus, a pattern of datapoints having relatively small amplitudes and datapoints having relatively large amplitudes gives rise to “peaks” in the trace. In some embodiments, a trace has more than five datapoints, more than 100 datapoints, or more than 1000 datapoints. In some embodiments, a trace has between two and 100,000 or more datapoints.
Processed traces for nucleic acid sequences are usually displayed in the form of chromatograms consisting of four curves of different colors, each curve representing the signal for one of the four bases and drawn left to right in the direction of increasing time to detection (increasing fragment size). An idealized trace consists of evenly spaced, nonoverlapping peaks, each corresponding to the labeled fragments that terminate at a particular base in the sequenced strand. Thus, for nucleic acids, there will be four traces, with each trace representing a unique nucleotide. Real traces deviate from this ideal for a variety of reasons, including possible imperfections in the sequencing reactions, gel electrophoresis and trace processing. Due to the anomalous migration of very short fragments (caused by relatively greater effects of the dye and specific base sequence on mobility) and unreacted dye—primer or dye—terminator molecules, the first fifty or so peaks of a trace are often noisy and unevenly spaced. Toward the end of the trace, the peaks become progressively less evenly spaced as a result of less accurate trace processing, less resolved as diffusion effects increase and the relative mass difference between successive fragments decreases, and more difficult to distinguish from noise as the number of labeled fragment molecules of a given size decreases. In particular, poorly resolved peaks for the same base may yield a single broad, often lumpy peak.
In better resolved regions of the trace, the most commonly seen electrophoretic anomalies are compressions (Sanger and Coulson 1975, J. Mol. Biol. 94:441-448; Sanger et al., 1977, Proc. Natl. Acad. Sci. 74:5463-5467), which occur when bases near the end of a single-stranded fragment bind to a complementary upstream region, creating a hairpin-like structure that migrates through the gel more rapidly than expected from its length, thus causing a peak to be shifted away from its expected position. This can result in one peak being beneath another, or in two successive peaks for the same base being merged into one. Dye-terminator chemistry appears to resolve most compressions (Lee et al. 1992, Nucleic Acids Res. 20:2471-2483) but this chemistry has its own data quality problems caused by reduced polymerase affinity for the dye-labeled terminal nucleotide.
The goal of base-calling software is to produce a sequence as accurate as possible in the face of the above data problems. As used herein, the term “base-calling” refers to the process of determining the identity of a nucleotide base in a nucleic acid sequence.
Some of the earliest base-calling software was part of the processing software installed on the first ABI sequencing machines (Connell et al., 1987, BioTechniques 5:342-348). That ABI software is often used as a benchmark by which other methods are judged. Although full algorithmic details have not been published, according to an ABI description of its base-calling software (ABI 1996), the program uses mobility curves to predict the peak spacing, and identifies the most likely peak in intervals of the nominal peak spacing, assigning an N in the absence of a good choice. Subsequently, the ABI software adds and removes bases using a criterion involving the uniformity of peak spacing.
The advent of high volume sequencing has prompted development of other base-calling software programs (Giddings et al., 1993, Nucleic Acids Res 21:4530-4540; Golden et al., 1993, in Proceedings of the First International Conference on Intelligent Systems For Molecular Biology, Hunder et al. eds, pp, 136-144, AAAI Press, Menlo Park, Calif.; Golden et al., 1995 in Evolutionary programming IV. Proceedings of the Fourth Annual Conference on Evolutionary Programming, 579-601; Berno, 1996, Genome Research 6:80-91). These programs all perform multiple gel image processing steps including base-calling, and have the merit of allowing efficient centralized processing of the data on a computer independent of the sequencing machine. However, none of these software programs identify heterozygous peaks in traces in a satisfactory manner.
The ABI base-calling software that is available for ABI sequencers can reliably identify approximately half of the available bases. The remaining half typically contains a large number of errors that are either discarded or must be manually corrected by an operator. Work has been done to try and improve upon the accuracy of the ABI base-calling software. For example, Ewing et al., 1998, Genome Res. 8:175-185 describe a base-calling program named phred for automated sequencer traces that achieves a lower error rate than the ABI software. Phred averages between 40 percent to 50 percent fewer errors than the ABI software in test data sets examined independent of position in read, machine running conditions, or sequencing chemistry.
Although the above-identified software programs represent important accomplishments in their own right, they do not satisfactorily base-call heterozygous nucleic acid samples. Each sequencing trace taken from a heterozygous DNA sample (e.g., a human DNA sample) is the product of a sequencing reaction on two physical chromosomes, the maternally derived chromosome and the corresponding paternally derived chromosome. For example, consider the case in which a region of human chromosome IV is to be sequenced. Primers prepared for this sequencing reaction bind to both the maternally derived chromosome IV and the paternally derived chromosome IV. Thus, a mixture of the nucleic acid sequence from the maternally derived and the paternally derived chromosome IV is sequenced. Unlike the case of DNA samples from controlled mouse crosses (where both chromosomes are the same because the parents are inbred), the maternally derived human chromosome IV and the paternally derived human chromosome have points where they differ. That is, there are many heterozygous points (base positions) in the chromosomes where there are different alleles at corresponding positions on the maternally derived and paternally derived chromosomes. At each point where there is heterozygosity between corresponding base positions in the maternally derived and paternally derived chromosomes, two peaks will arise in the trace, one for each of the nucleotides. In humans, approximately one in every 500 to 1000 bases has this heterozygosity. Conventional base-calling software does not satisfactorily identify such double peaks. Rather, such peaks are typically designated as “not read.”
The base-calling program TraceTuner (Paracel, Pasadena, Calif.) does have some capability for detecting and recognizing heterozygous bases. However, the heterozygous base recognition algorithms of TraceTuner are unsatisfactory because they require manual intervention. Accordingly, the art still needs improved systems and methods for automatically recognizing heterozygous base pairs in heterozygous nucleic acid samples.