Next-generation sequencing has greatly increased the throughput of sequencing methods and resulted in new applications for sequencing with important real-world implications, such as improvements in cancer diagnostics and non-invasive prenatal testing for disorders such as Down's Syndrome. There are various technologies for performing next-generation sequencing, each of which is associated with specific types of errors. In addition, these methods share general sources for errors, such as errors that occur during sample preparation.
Sample preparation for next-generation sequencing typically involves numerous amplification steps, each of which generates errors. Amplification reactions, such as PCR, used in sample preparation for high-throughput sequencing can include amplifying the initial nucleic acid in the sample to generate the library to be sequenced, clonally amplifying the library, typically onto a solid support, and additional amplification reactions to add additional information or functionality such as sample identifying barcodes. Errors can be introduced during any of the amplification reactions, for example through the misincorporation of bases by a polymerase used for the amplification. It can be difficult to distinguish these errors introduced during sample prep and errors that occur during a sequencing reaction, from real and informative SNPs, or mutations present in the initial sample, especially when the SNPs or mutations are present at a low frequency. In addition, calling the base at each nucleotide can introduce errors as well, usually caused by a low signal intensity and/or the surrounding nucleic acid sequence.
There are several known methods to identify errors caused by sample preparation. One method is to have greater sequencing depth such that the sample nucleic acid segment is read multiple times from the same molecule, or from different copies of the same nucleic acid molecule. These multiple reads can be aligned and a consensus sequence can be generated. However, SNPs or mutations with low frequency in the population of nucleic acid molecules will appear similar to errors introduced during amplification or base calling. Another method to identify these errors involves tagging nucleic acid molecules such that each nucleic acid molecule incorporates a unique identifier before being sequenced. The sequencing results from identically tagged nucleic acid molecules are pooled and the consensus sequence from these pooled results is more likely to be the true sequence of the nucleic acid from the sample. Amplification errors can be identified if some of the identically tagged nucleic acid molecules have a different sequence.
Despite these prior methods, there is a need to discover advantageous combinations of parameters for methods of tagging nucleic acid molecules that are highly effective and readily manufacturable, especially for analyzing complex samples, including mammalian cDNA or genomic samples such as, for example, circulating DNA samples. Many prior art methods require the generation of large numbers of unique identifiers and may also result in the need for longer unique identifiers. The reaction mixtures in such methods are designed so there is a large excess of unique identifiers relative to sample nucleic acid molecules. In addition to the high cost of making such libraries of unique identifiers, increasing the lengths of the unique identifiers reduces the amount of sample nucleic acid sequence that can be read in the already limited read lengths of most next-generation sequencers. In other prior art disclosures, which sometimes are only prophetic, detailed combinations of parameters are absent, for combinations such as the diversity of identifiers or the diversity of combinations of any two identifiers versus the number of copies of the region of interest, the diversity of identifiers versus the total number of sample nucleic acid molecules, and the total number of identifiers versus the total number of sample nucleic acid molecules. This is especially true for samples that are complex and isolated from nature, such as cDNA or genomic samples, including fragmented genomic samples, such as circulating free DNA in mammalian blood.
There remains a need for a low-cost tagging method, and for identification of combinations of key parameters for tagging complex samples isolated from nature. Such a method would provide benefit, for example, for detecting amplification and base calling errors when used in a high-throughput sequencing workflow, especially in the analysis of complex, clinically-relevant samples.