The biochemical processes used to build and maintain living organisms are controlled by chains of nucleic acids, such as deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). Each nucleic acid is made up of a sequence of nucleotides consisting of a sugar (e.g., deoxyribose, ribose) and a nitrogen base having a triphosphate group (abbreviated as dNTP, d=deoxyribose, N=nitrogen base, TP=triphosphate). The bases that make up DNA are adenine (A), cytosine (C), guanine (G) and thymine (T). RNA molecules have the base uracil (U) instead of thymine.
A molecule of DNA can exist as two nucleic acid strands linked together by hydrogen bonds between the bases of each strand to form a double-helical structure (double-stranded DNA (dsDNA)). The bases will only bind specifically to each other (adenine to guanine and cytosine to thymine) such that the strands of a dsDNA molecule are complementary. DNA also can exist as a single-stranded molecule (ssDNA), such as the DNA in the parvovirus. A molecule of RNA can be single stranded (ssRNA), or in some organisms (e.g., rotavirus) it is double-stranded (dsRNA), with cytosine binding to uracil.
Determining the sequence of a nucleic acid strand is useful for a variety of research and commercial applications (e.g., basic science research, applied research, forensics, paternity testing, etc.). Thus, nucleic acid sequencing tools are some of the most important tools in biotechnology. One such exemplary useful tool is an automated fluorescent sequencer that sequences DNA by analyzing color signals emitted by fluorescently-labeled DNA fragments. Using the Sanger chain termination method, the DNA fragments are labeled with synthetic fluorescent nucleotides. Fluorescent nucleotides having different bases (A, C, G and T) are labeled with different fluorescent compounds so that each base emits a different color of light. The labeled fragments are then sorted by mass using polyacrylamide gel electrophoresis and the fluorescent signals emanating from the gel are detected. A software program (referred to as a base caller) identifies the base at a particular position in the sequence based on the color and intensity of the emissions.
Fluorescently labeled DNA fragments are produced using a polymerase chain reaction (PCR) performed with fluorescent dideoxynucleotides (ddNTPs: ddATP, ddCTP, ddGTP and ddTTP). PCR is a two-step technique for copying (amplifying) DNA. In this technique, a dsDNA sequence under study is denatured to separate the sides of the double-stranded DNA and incubated with DNA primers (synthesized DNA fragments), the four deoxyribonucleotide triphosphates (dNTPs: dATP, dCTP, dGTP and dTTP) and the polymerase enzyme. Since the primers will bind to a complementary sequence of DNA, the sequence of the primers is chosen to select for the particular sequence of DNA under study. The polymerase enzyme will extend the bound primers into complementary strands of the DNA under study using the dNTPs as substrates.
When enough of the target DNA fragment has been amplified through PCR, a final annealing is performed using the four fluorescent ddNTPs. The four fluorescent ddNTPs are labeled with different fluorescent compounds and so emit an identifying color. Since the ddNTPs do not have a hydroxyl group (—OH) on their sugar component to allow the next nucleotide to attach, the growing chain terminates. Because the length of the fragment depends on how soon the polymerase incorporated a ddNTP into the growing complementary strand (and blocked further growth of the strand), the resulting mixture contains DNA fragments of different lengths.
Polyacrylamide gel electrophoresis is used to sort the fragments by mass (i.e., length). To accomplish this, the mixture of fragments is placed in gel-filled capillaries and a voltage is applied across the capillaries to get the slightly-negative DNA moving downward. After the fragments have sufficiently migrated through the gel, a laser is used to scan the gel in a particular order to excite the fluorescent molecules. A detector then detects the emissions and the raw data is corrected for known issues with the method (e.g., non-linear gel mobility effect) to produce a chromatogram.
A base caller algorithm determines the sequence by analyzing the color, intensity and time (which corresponds to position) of the emissions. A schematic example of a chromatogram (traces) is shown in FIG. 1 and examples of raw traces and a processed chromatogram are shown in FIG. 2. In FIG. 1, different colors are represented by different types of lines (solid, broken and thickness), the y-axis indicates the intensity of the emission and the x-axis indicates the position of the base in the sequence.
The algorithms employed by base callers are imperfect. For instance, a base caller will assign the base to a sequence position by determining the emission with the largest amplitude at a given position (the peak). However, sampling errors due to the low sampling rate used to obtain the data can occur, leading the base caller to rely on a data point that is not a true peak (as shown in FIG. 3). In FIG. 3, an “x” represents a sampled data point and circles represent the data point chosen by the base caller to call the base.
Moreover, the base caller algorithm assumes that the sample being sequenced contains only a single version of the nucleic acid of interest. However, if the nucleic acid sample under study is, for instance, a DNA sequence taken from a population of organisms having polymorphic sequences (i.e., a genetic locus that varies in content across a population of organisms), the sample likely will contain multiple variants of the gene. Chromatograms of such mixed samples will show complex patterns reflecting combinations of alleles. This reduces the accuracy of the sequencing, often leading to the data being thrown out and the experiment repeated. Thus, existing base callers function best when a sample contains only a single sequence of DNA.