Many current next-generation sequencing (NGS) technologies use a form of sequencing by synthesis (SBS). NGS technologies have the ability to massively parallel sequence millions of DNA templates. To attain high-throughput, many millions of single stranded templates are arrayed across a chip and the sequence of each template is independently read. Second-generation NGS platforms clonally amplify DNA templates on a solid support followed by cyclic sequencing. Third-generation NGS platforms employ single molecule PCR-free protocols and cycle-free chemistry (Schadt et al., Hum Mol Genet., 19(R2):R227-40, (2010)).
Major limitations of NGS methods and other high-throughput sequencing methods include sequencing and amplification error and bias. Due to error and bias associated with amplification and sequencing, these sequencing technologies deviate from the ideal uniform distribution of reads and can impair many scientific and medical applications. For clinical applications, labs must verify the accuracy of a mutation or a single nucleotide polymorphism (SNP) call before reporting to a patient. Typically sequence verification is done by making a Sanger library of the target after obtaining the sequences and “Sanger qualifying” the next-generation sequencing (NGS) results. To overcome the higher error rate of NGS platforms compared to traditional Sanger sequencing a high level of redundancy or sequence coverage is required to accurately call bases. A 30-50× coverage is typically required for accurate base calling, although this can vary based on the accuracy of the sequencing platform, variant detection methods, and the material being sequenced (Koboldt D C et al., Brief Bioinform., 11:484-98 (2010)). In general, all second-generation platforms produce data of a similar accuracy (98-99.5%), relying upon adequate sequence depth e.g., coverage) to make higher accuracy base calls.
Sequencing bias can manifest as coverage bias (deviation from a uniform distribution of reads) and error bias (deviations from uniform mismatch, insertion, and deletion rates). Current sequencing technologies are limited because the chemistries used in high-throughput sequencing methods are inherently biased. Some nucleotide sequences are read more frequently than other sequences, and have an inherent error rate. Depending on many factors, including the sequencing platform used, read errors (most of which are misidentified bases due to low quality base calls) can occur anywhere in the range of one error per 100-2000 bases. While coverage bias is an important sequencing metric, variations in sequence accuracy are also important.
Another major limitation is PCR amplification bias, because conditions during library construction of nucleotide templates for sequencing can significantly influence sequencing bias. PCR amplification for library construction has been shown to be a source of sequencing data error (Keohavong P et al., PNAS 86:9253-9257 (1989); Cariello et al., Nucleic Acids Res., 19:4193-4198 (1991); Cline et al., Nucleic Acids Res., 24:3546-3551 (1996)). Library construction methods can affect evenness of coverage. For example, PCR amplification is also a known source of under coverage of GC-extreme regions during library construction (Aird et al., Genome Biol., 12:R18 (2011); Oyola et al., BMC Genomics, 13:1; 22 (2012); Benjamini et al., Nucleic Acids Res., 40:e72 (2012)). Similar biases may also be introduced during bridge PCR for cluster amplification and on some NGS platforms strand-specific errors can lead to coverage biases by impairing aligner performance (Nakamura et al., Nucleic Acids Res., 39:e90 (2011)). Other platforms that utilizing a terminator-free chemistry can be limited in their ability to accurately sequence long homopolymers, and can also be sensitive to coverage biases introduced by emulsion PCR in library construction (Rothberg et al., Nature, 475:348-352 (2011); Margulies et al., Nature 2005, 437:376-380 (2005); Huse et al., Genome Biol., 8:R143 (2007); Merriman et al., Electrophoresis, 33:3397-3417 (2012)).