The high demand for low-cost sequencing has driven the development of high-throughput sequencing technologies. In such technologies, millions of sequences are produced in parallel. For example, 454 Life Sciences, now Roche Applied Sciences, developed a high throughput sequencing technology of a sample DNA involving the steps of fragmenting DNA, ligating adapters to the DNA fragments, capturing single DNA fragments with a bead coated with primers, amplifying each DNA fragment on a bead inside water droplets in oil (emulsion PCR), and subsequently loading each bead in a picoliter-well and sequence each amplified DNA fragment with pyrosequencing. In general, high throughput sequencing technologies involve the ligation of adapters to DNA fragments, which adapters may comprise primer binding sites used for capture, amplification and/or sequencing of the DNA fragments. Because large numbers of sequences can be produced, samples from different origin are often combined in a single high throughput sequencing run. In order to trace back the origin of each sample from a pool of samples, current high throughput sequencing applications rely on the use of nucleotide sequence identifiers. The term nucleotide sequence identifier (NSI, (sequence-based barcode or sequence index are terms that are interchangeable and have the same meaning. A nucleotide sequence identifier is a particular nucleotide sequence that is used as an identifier. A nucleotide sequence identifier is included in the adapter downstream of the primer binding site such that when sequenced from the primer binding site, the nucleotide sequence of the identifier sequence is determined. Different adapters comprising different nucleotide sequence identifiers are ligated to different samples, after which the samples can be pooled. When the sequences are determined of the pooled samples, the nucleotide sequence identifier is sequenced along with part of the sequence of the fragment to which the adapter is ligated. The presence or absence of the nucleotide sequence identifier thus determines the presence or absence of a sample DNA in the pool. The sequence of the internal sequence that is sequenced along with the nucleotide sequence identifier further enables to assign that sequence to a particular sample from which it originated, as the nucleotide sequence identifier serves to identify the sample DNA origin.
For example, in the high throughput sequence system developed by Roche, the Genome Sequencer FLX system, multiplexed identifier sequences (MIDs) are used. The MIDs are 10-mer sequences that are incorporated into the adapters to assign sequence reads to individual samples. Over 100 different MIDs are currently in use (454 Life Science Corp (2009) Technical Bulletin No. 005-2009). Similar nucleotide sequence identifiers are available for other sequencing systems.
Methods, wherein nucleotide sequence identifiers are incorporated in the 5′-end of a primer, are e.g. described by Rigola et al. PLoS ONE. 2009; 4(3): e4761 and in WO 2007/037678. Typically, the nucleotide sequence identifiers do not have significant complementarity with the target sequence. A primer thus comprises at the 5′-end a section comprising a nucleotide sequence identifier and at the 3′-end the sequence which is complementary to the target sequence. When a sample is amplified with a primer pair of which a primer comprises a nucleotide sequence identifier, the amplicon will include the nucleotide sequence identifier. When samples are subsequently pooled, and subjected to high-throughput sequencing methods, the nucleotide sequence identifier will serve to identify the origin of the sequenced amplicon. Hence, the origin of the amplicon is determined by determining the nucleotide sequence identifier. Concomitantly, the internal sequence which has been amplified and is also sequenced along with the nucleotide sequence identifier can also be traced back to the samples from which they originate.
In both scenarios, an adapter or primer which comprises a nucleotide sequence identifier, the concept is the same, namely to determine the sample origin of sequences produced using high throughput sequencing platforms from a plurality of DNA samples that have been multiplexed, e.g. combined or pooled, somewhere in the sample preparation process.