Identifiers (e.g., sample barcodes or molecular barcodes) can be present in nucleic acids for a variety of purposes. Most commonly, sample barcodes are added to target nucleic acid molecules prior to the amplification and/or sequencing of such molecules, so that the origin or source of sequence information can be identified. Nucleic acid molecules from different samples can be pooled together and subjected to massively parallel sequencing in order to efficiently determine sequence information from numerous different samples. Prior to sequencing, sample identifiers (often referred to as sample barcodes) can be added to the nucleic acid molecules, and this facilitates grouping, analysis, and interpretation of information. As another example, molecular barcodes can be added to target nucleic acid molecules prior to amplification, so that the replicates of the initial target molecule can subsequently be identified and grouped together.
Sample barcodes are frequently used with target molecules that will be analyzed by massively parallel sequencing, so that nucleic acid molecules from different samples can be pooled for sequencing, and the sequence information can be assigned to a sample. Scientists and laboratories that perform massively parallel sequencing occasionally detect a sample barcode in a pool even when this sample barcode was not included in the sequencing pool. This indicates that a contaminating sample barcode is present in the pooled nucleic acids, which may be caused by a sample barcode aliquot containing more than one sample barcode sequence, namely the expected barcode sequence and the contaminating barcode sequence. Contaminating barcodes could be introduced at any stage of the preparation of sample barcode aliquots, beginning from the earliest stage, including the synthesis and purification of DNA oligos, or though handling steps in the process of diluting and aliquoting sample barcode sequences. Even when present at low frequencies, such as 1% or lower, the presence of contaminating sample barcodes can create problems with regard to the reliability and interpretation of the sequence information.
Sample barcodes are often provided in a set of containers, such as a well plate, where each container holds a different sample barcode. When the sample barcodes are used in laboratory analysis, such as by pipetting the sample barcodes from their containers to the various samples to be analyzed, there is a risk that a container or sample may become contaminated.
Contamination of sample barcodes could be detected by preparing individual sequencing libraries for each sample barcode and sequencing them individually. Alternatively contamination could be detected with a pooling scheme that provides the ability to compare a sample barcode and contamination of another sample barcode in at least one of the pools. However, a large number of pools would have to be prepared and sequenced in separate sequencing runs in order to isolate sample barcodes from a large number of samples, such as 48 or 96 samples. This would be expensive, inefficient and time-consuming. It also has the potential of erroneously finding contamination in a sample barcode that was not present in the tube, but instead introduced in one of the many library preparation steps, leading to false positives.