In genetics, the term sequencing may refer to methods for determining a primary structure or sequence of a biopolymer, including a nucleic acid (e.g., DNA, RNA etc.). More specifically, DNA sequencing is the process of determining an order of nucleotide bases (adenine, guanine, cytosine and thymine) in a given DNA fragment. Such sequencing methods commonly include calling a base at a position in a nucleic acid, where the called base is used to determine a sequence for the nucleic acid.
When sequencing target nucleic acids, for example, the process typically includes extracting and fragmenting target nucleic acids from a sample. The fragmented nucleic acids are used to produce target nucleic acid templates that will generally include one or more adapters. The target nucleic acid templates may be subjected to amplification methods, such as bridge amplification to provide a cluster or rolling circle replication to provide a nucleic acid “nanoball.” Sequencing applications are then performed on the single-stranded nucleic acids, e.g., by sequencing by synthesis or by ligation techniques, including combinatorial probe anchor ligation (cPAL).
An intensity value (e.g., a fluorescence signal) corresponding to a base that is incorporated into a nucleic acid at a particular position can indicate the base at that position. For example, four different types of fluorescence may be used, corresponding to the four types of bases to be identified. The nucleic acids are amenable to relatively inexpensive and efficient imaging techniques in which the nucleic acids are captured in four color images, one for each type of fluorescence used. The four images can then be processed through software to extract intensity information. Examples of incorporation are synthesis, ligation, and hybridization.
As mentioned above, the intensity values (signals) can be used to call a base at a position of the nucleic acid, i.e., perform basecalling. The intensity value for a target nucleic acid template can correspond to one pixel or multiple pixels of an image, or there can be multiple templates for a pixel (i.e., more than one template per pixel). Regardless, an intensity value for each of the four bases can be assigned to a template. Naively, one can call the base corresponding to the maximum intensity value, but this has a high error rate. For example, the determination of the intensity value can be incorrect due to optical effects (e.g., overlap in spectrum of the various intensity signals) and spatial effects (e.g., when multiple templates correspond to a single pixel). Additionally, the biochemistry of the sequencing process can cause artifacts and the intensity signals can vary significantly from one position and template to another (e.g., due to differences in amplification of one template to another), and from sample to sample.
Accordingly, it would be desirable to provide improved methods and systems for making base calls.