High throughput sequencing (next generation sequencing, NGS) technology has penetrated from its early de novo sequencing into application areas that require high sensitivity and high accuracy. It has found great promises in detection of low frequency mutations, e.g., in formalin-fixed, paraffin-embedded (FFPE) DNA or ctDNA in plasma for non-invasive early diagnosis, in RNA expression, and in detection of low copy number targets (e.g., pathogens, drug-resistant variants, etc.) through deep sequencing. PCR has been an intrinsic part of NGS technology. It has been incorporated in almost all sample prep methods for NGS. Thermal stable DNA polymerase has been engineered and used as a key enzyme in the sequencing chemistry. One of major problems of high sensitivity sequencing is the presence of a vast number of random errors. These random errors are produced in PCR during sample prep, in hybridization capture through chemical modification of bases, or in the sequencing step by the error of the DNA polymerase. It may also be brought in from FFPE sample, or through oxidation from air. Usually, hundreds to thousands of random errors occur at 0.1-0.2% frequency and lower, making it impossible to find low-frequency de novo variants.
The method of using a short stretch of random (or partially random, or fixed) nucleotide sequence to label individual target molecules, thereby to eliminate PCR duplicates and reduce random base errors, has been reported since 2007 (Nucleic Acids Res 2007, 35:e91; Nucl. Acids Res. 2011 39: e81; Proc. Natl. Acad. Sci. 2011 108: 9026-9031; Nat. Methods 2011 9: 72-74). Different names, including molecular barcodes, molecular indexes, single molecular identifiers (SMI), unique identifiers (UID), unique molecular identifiers (UMI), primer ID, duplex barcodes, etc., have been used to describe this short nucleotide sequences. Molecular barcodes are usually added onto the target molecules by ligation or through primers during PCR or reverse transcription. They have been widely used in quantitative studies of gene expression through RNA sequencing, in studies of single cells, and in detecting low frequency mutations in FFPE-derived DNA and cfDNA through deep sequencing. After sequencing, they are used to trace from the amplified end molecules to their original molecules, by consolidating the sequences of the end molecules harboring identical molecular barcodes into a consensus sequence. These original molecules can be either strand of the DNA targets, or both strands. The power of labeling both strands of a DNA target with identical molecular barcodes, a technique named “duplex sequencing” (Proc Natl Acad Sci USA 2012, 109: 14508-14513; U.S. Pat. No. 9,752,188), allows a further round of deducing consensus sequence, and removing random errors. Duplex sequencing has superior sensitivity and significantly reduces the number of random errors. However, the published methods of various forms of duplex sequencing require ligation to add molecular barcodes onto the targets (Nature Medicine 2014, Nucleic Acids Res 2016, 44:e22 doi:10.1093/nar/gkv915; Nature Biotechnology, 2016, doi:10.1038/nbt.3520; Sci. Transl. Med. 2017, 9, eaan2415; Nature 2017, 7:3356, DOI:10.1038/s41598-017-03448-8;). These methods take several hours to two days of work time and numerous reagents and equipment. They demand up to several hundreds of nanograms of DNA, and have low efficiency to detect rare mutations. It seems impossible to use PCR-based methods to add identical molecular barcodes onto the double strands of the same targets for duplex sequencing. The key problem is that an original target molecule is amplified into multiple molecules, each with a different molecular barcode. These redundant barcodes make it impossible to trace themselves to the sense and antisense strand of the original double stranded DNA. For example, previous reports (U.S. Patent No. 2014/0227705, U.S. Pat. Nos. 8,741,606, 8,728,766, 8,685,678, 8,722,368, 8,715,967; Nucl. Acids Res. 2016 1-7) demonstrate methods of using 1 to 3 PCR cycles to introduce molecular barcodes onto the amplification products. These methods cannot support duplex sequencing due to that fact that either only one strand of the target DNA was labeled with molecular barcode, or redundant barcodes were produced from one original DNA molecule.
Usually tens to hundreds of amplicons are assigned with individual molecular barcode and amplified simultaneously in the same reaction vessel with a panel of primer pairs. In such a multiplex primer extension reaction, a great deal of non-specific amplification products is created between primers, between primers and template, or both. It is also necessary to remove these non-specific amplification products to make specific amplification of target sequences possible, and to further reduce the reading depth by removing the non-specific amplification products. Described herein are methods that reduce redundant molecular barcodes, while simultaneously removing non-specific amplification products.