Due to the recent availability of next generation sequencing systems, amplicon sequencing and ultra deep sequencing have become a powerful analytic tool for mutational analysis. Ultra deep sequencing requires sequencing of many individual molecules derived from and representing a polymorphic target sequence. In particular, ultra deep sequencing allows for detection and quantification of minority species sequences of a certain target nucleic acid within a background of wildtype sequences. For example, such detection and quantification is particularly useful for detection of residual tumor cells in the field of oncology, or for the detection of drug resistancies in the field of virology.
However, ultra deep sequencing always requires a step of amplification of the target nucleic acid sample to become analyzed. During this amplification, errors are introduced with a certain degree of frequency during the polymerase catalyzed PCR process. As a consequence, such artificially introduced sequence changes usually can not be discriminated from minority species sequences such as real mutations or sequence alterations that only exist within the sample to become analyzed with low abundancy (Vandenbroucke et al., Biotechniques 2011, 51 (3), 167-177).
One possibility to overcome this issue is the tagging of each individual target nucleic acid molecule derived from the sample prior to amplification with a unique nucleic acid sequence. In the art, such a tag is termed UID sequence (Unique sequence Identifier).
For example, Jabara et al. (PNAS 2011, 108 (50), 20166-20171) disclose the introduction of a random sequence tag in the initial amplification primer. As a consequence, subsequent sequencing allows to identify all sequence reads which are derived from the same individual target molecule originally contained within the sample. Jabara et al. used an 8mer wobble tag consisting of all 4 nucleotides resulting 65536 unique different sequence tag sequences. This approach, however, has the disadvantage that due to too low number of different tag sequences identical tag sequences will be tagged to different target sequences making the UID approach in such cases redundant (Sheward et al., PNAS 2012, 109 (21), E1330).
Similarly, WO 2012/0388239 and its equivalent US2012/0071331 disclose a method of estimating the number of starting polynucleotide molecules sequenced using degenerate UID sequences (in the applications termed “DBR”). The claimed method always comprises the steps of a) tagging, b) pooling c) amplification and d) sequencing. The references also disclose application of such a method of attaching individual UID tags to individual sequences from a polymorphic region, deep sequencing and subsequent allele calling.
Identification and quantification of rare sequence variants within a high wildtype background requires a high degree of accuracy in particular when such an analysis is performed within a diagnostic setting. However, the use of completely randomized individual sequence tags comprises certain disadvantages with a negative impact on the accuracy of tag identification.
First of all, tags comprising homopolymer stretches within a UID sequence cannot be read with high accuracy using commercially available sequencing by synthesis platforms such as the 454 Genome Sequencer system or the Ion Torrent Proton system. Secondly, particular stretches such as G tetrads strongly interfere with PCR amplification. Thirdly, complementary homopolymer stretches result in partially self complementary sequence tags, leading to undesired side reactions during an amplification or a subsequent sequencing reaction.
A further disadvantage of the state of the art is that tagging of each individual target sequence with a unique sequence tag is not accomplished and known as “birthday problem” (Sheward et al., PNAS 2012, 109 (21), E1330).
Thus it is an object of the present invention to provide a solution for ultra deep sequencing applications with improved sequence tags.