Next-generation sequencing can be used to assess the sequence of millions of DNA strands in parallel. For instance, in Illumina's sequencing technology, multiple clonal clusters of DNA are formed randomly on a surface, and sequencing by synthesizing is performed by using the cluster DNA as a substrate. During each sequencing cycle, one new base is evaluated on each of the DNA strands in parallel. Thus, it is important that the clusters be unambiguously identified during the synthesis steps. If all the clusters on a flowcell contain the same base at the same location, the software is unable to distinguish the base correctly and the sequencing quality can decrease, or the sequencing run can fail. Most sequencing platforms encounter this technical problem when a majority of the DNA strands to be sequenced have an identical base at the same position.
For some applications, it is desirable to label, tag, or barcode certain DNA molecules before sequencing. This means that the molecules to be sequences can contain at least two regions: (1) a barcode, and (2) capture nucleotides. It is typically necessary to have a large number of different barcodes as well as have identical copies of these barcodes. It is desirable to label DNA molecules that are meant to be grouped together with the same barcode. Identical barcodes can either be included in different wells or tubes, or in other cases, they can be physically linked to a plurality of beads.
There are a number of different ways to create the barcode and capture nucleotides. For instance, primer extension can be used for synthesis. By employing this method, the barcode nucleotides can have at least two regions: (1) nucleotides used as a barcode, and (2) nucleotides used as either priming, hybridization, or linking sites. As such, the entire barcoding region can be synthesized using building blocks by piecing the blocks together by, e.g., ligation, hybridization or PCR utilizing constant, universal priming sites. These constant, universal priming sites can cause sequencing problems. For example, the majority of the sequences can have identical or nearly identical nucleotide patterns (sequences) at the same positions along the DNA strands. In some cases, the sequences may have low diversity at every position.
Solutions to this problem include using random, non-relevant sequences such as a PhiX control (Illumina) in the sequencing run to increase diversity across the clusters. This creates sufficient variation during each sequencing cycle, but dilutes the accurate samples and reduces the sequencing capacity.