Sanger sequencing is a widely used chemical process for DNA sequencing. In Sanger sequencing, a single strand of DNA is replicated using a chain-termination method, which typically involves a reaction of the single-stranded DNA with a DNA primer and DNA polymerase that together perform DNA replication. Fluorescently labeled nucleotides specific to each of the four nucleotide base types, Adenosine, Cytosine, Guanine and Thymine (A, C, G, T), may be included in the reactions. The DNA sample may be divided into four separate sequencing reactions, containing the four standard deoxynucleotides (dATP, dGTP, dCTP and dTTP) and the DNA polymerase. To each reaction is added one of the four dideoxynucleotides (ddATP, ddGTP, ddCTP, or ddTTP). These dideoxynucleotides are chain-terminating nucleotides that cause DNA replication to be terminated by the incorporation of the chain-terminating nucleotides, resulting in strand fragments of different lengths.
The resulting fragments may be electrophoretically separated through gels or capillaries by length, which is inversely proportional to their traveling speed. As a result, the fragments move through the gel or capillary in order from shortest length to longest length. A laser or other excitation source may be positioned proximate the capillary to excite the fluorescently labeled nucleotides. Optical detection equipment may likewise be positioned to detect the fluorescence from the excited chain-terminating nucleotides to categorize the nucleotides into the four base-types. The optical detection equipment may capture the fluorescence emitted by the excited nucleotides as an image, such as a chromatogram. Because the fragments are ordered by length and pass the optical equipment in sequence from shortest fragments to longest fragments, the order of the base-types of the DNA sequence is encoded as a function of time. Each of the four DNA synthesis reactions is run in one of four individual “lanes” corresponding to the four base types A, C, G and T and the pattern of fluorescence of the excited nuclei may he captured as an image and recorded.
There are many techniques using gel electrophoresis, capillary electrophoresis and other methods that are suitable for obtaining an image corresponding to a DNA sequence. For example, some techniques include adding four different dyes associated with respective ones of the four base types into a single reaction or chemical sequencing process. It should be appreciated that the base-calling techniques described herein may be used with any suitable method of obtaining one or more DNA sequence images, as the aspects of the invention are not limited to any particular chemical sequencing process or method of image acquisition.
FIG. 1 illustrates a schematic of an image captured using the above described process. In image 100, the dark bands correspond to fragments of different lengths. A dark band in a lane indicates a fragment that is the result of chain termination after incorporation of the respective one of the chain-terminating nucleotides (ddATP, ddGTP, ddCTP, or ddTTP). The terminal nucleotide base can be identified according to which dideoxynucleotide was added in the reaction giving that band. The relative positions of the different bands among the four lanes are then used to read (from bottom to top) the DNA sequence as indicated. It should be appreciated that image 100 is schematic and the dark bands in actual images will vary in intensity.
The process of extracting the DNA sequence from the image is referred to herein as “base-calling.” Manual base calling is tedious and time-consuming and prone to human error. To expedite this process, many automatic methods have been developed to process the images and extract the sequence of bases. For example, Phred is a widely used algorithm for base calling a single sequence captured on various standard types of image formats. Parametric deconvolution, Kalman prediction with dynamic programming, and Markov Monte Carlo methods have all been used for single strand base-calling.
As discussed above, the intensity of the dark bands varies over the domain of the band. In general, the shape of the intensity variation (which corresponds to concentration) ramps up exponentially to a peak value and then decays in a similar fashion. Accordingly, processing an x-ray or gel image may include extracting four time varying signals corresponding to each of the four base types, respectively. Signal 150 shown in FIG. 1 illustrates the four time-varying signals together, with the different base types (identified by the “lane” position of the corresponding band) denoted using different line patterns. From the time-varying signals, the DNA sequence may be determined. The time-varying signals extracted from an image are individually referred to herein as a “trace.”
It should be appreciated that time-varying signal 150 is an idealized extraction. Actual signals are typically degraded from base-line noise, amplitude variation, increasing pulse widths which deteriorates peak resolutions, jitter in peak spacings which contributes to inter-symbol interference (ISI), etc.