Next Generation Sequencing (NGS) has evolved into a very powerful tool in molecular biology, allowing for the rapid progress in fields such as genomic identification, genetic testing, drug discovery, and disease diagnosis. As this technology continues to advance, the volume of nucleic acids that can be sequenced at one time is increasing. This allows researchers to sequence larger samples, as well as to increase the number of reads per sample, enabling the detection of small sequence variations within that sample.
As the volume and complexity of NGS processing increases, so does the rate of experimental error. While much of this error occurs in the sequencing and processing steps, they can also occur during the sample preparation steps. This is particularly true during the conversion of the sample into a readable NGS library by which adaptor sequences are attached to the ends of each fragment of a fragmented sample (library fragment) in a uniform fashion.
There are several types of errors that can occur during the execution of next generation sequencing (NGS), and it is important to be able to differentiate between true rare variants, such as rare alleles or mutations that exist in the patient and errors that arise from sequencing and/or sample preparation. Particularly problematic are errors that are introduced during library construction, prior to library amplification via polymerase chain reaction (PCR). Such errors can propagate during PCR, leading to multiple copies of sequences containing the error, making it difficult to distinguish between the errors and true variants. The general strategy used to overcome this is consensus calling, whereby sequence reads that are PCR copies of a single, original fragment are grouped together and compared to similar groups of copies, derived from other original fragments, which overlap in sequence. If a variation is present in one group of clones and not the others then it is most likely an error propagated by PCR, whereas variations present in several groups are most likely true variants. In order to perform this analysis one must be able to differentiate between clones derived from one molecule and those derived from another.
The terms “fragments”, “target fragments”, or “inserts”, as used herein, refer to fragments of DNA, created from the fragmentation of a DNA sample, which are processed into an NGS library and sequenced. The processing of these fragments usually involves end repair and A-tailing, followed by the addition of sequencing adapters and amplification.
The terms “depth of coverage”, “coverage depth” or “target coverage”, as used herein, refer to the number of sequenced DNA fragments (i.e., a reads) that map to a genomic target. The deeper the coverage of a target region (i.e., the more times the region is sequenced), the greater the reliability and sensitivity of the sequencing assay. In general, a coverage depth of 500-1000×, or higher, is often required for the detection of low frequency sequence variations.
The terms “adenylated”, or “pre-adenylated”, as used herein, refer to a state by which a strand of DNA has an adenosine 5′-monophosphate (AMP) covalently attached to its 5′-terminal phosphate via a pyrophosphate bond. The terms “adenylate”, or “adenylation”, as used herein, refer to the process of covalently attaching an AMP either to a protein side chain or to the 5′-terminal phosphate of a DNA strand. The term “adenyl group”, as used herein, refers to an AMP that is either covalently attached to, or transferred between, a protein sidechain and/or DNA strand.
The term “consensus sequence”, as used herein, refers to a sequence obtained by comparing multiple sequences within a family of sequences. Sequence variations that are present in some, but not in the majority of sequences, in the family may be designated as errors and subsequently removed from the analysis. On the other hand, sequence variations that are present in the majority of sequences within a family may be designated as true variants that were present in the original genetic material being analyzed. The term “consensus calling”, as used herein, refers to the process to determining if a genetic variation is a true variation or an error.
The term “variant calling”, as used herein, refers to the process of determining if a sequence variation is a true variant derived from the original sample, and thus used in the analysis, or the result of a processing error and thrown out.
The term “family”, as used herein, refers to a group of reads that are determined to be duplicates based on their having the same start stop sites and/or UMIs. In variant calling, large families with multiple clones are desirable since they can be used to build stronger consensus sequences than those with only a few clones to compare. For very small family sizes with one or two clones, a consensus cannot be called, resulting in potentially important data being thrown out.
The term “deduplication”, or “dedup”, as used herein, refers to the removal of reads that are determined to be duplicates, from the analysis. Reads are determined to be duplicates if they share the same start stop sites and/or UMI sequences. One purpose of deduplication is to create a consensus sequence whereby those duplicates that contain errors are removed from the analysis. Another purpose of deduplication is to estimate the complexity of the library. A library's “complexity”, or “size”, as used herein, refers to the number of individual sequence reads that represent unique, original fragments and that map to the sequence being analyzed.
The terms “start stop sites”, or “fragment ends”, as used herein, refer to the sequences at the 5′ and 3′ ends of a sheared library fragment that become directly ligated to the sequencing adapters. Start stop sites can be used to determine if two similar sequences are derived from separate molecules or are cloned copies of the same original fragment. In order for different original fragments to have the same start stop sites, the shearing events that created them would have had to cleave at exactly the same sites, which has a low probability. Clones, on the other hand, should always have the same start stop sites. As such, any fragments that share the same start stop site (due to random shearing), are usually considered duplicates. The term “position-based”, as used herein, refers to the use of stop start sites as a criterion for determining whether or not a read is a duplicate of another.
A “start stop collision”, as defined herein, is the occurrence of multiple unique fragments that contain the same start stop sites. Due to the rarity of start stop collisions, they are usually only observed when either performing ultra deep sequencing with a very high number of reads, such as when performing low variant detection, or when working with DNA samples that have a small size distribution, such as plasma DNA. As such, start stop sites may not be enough in those scenarios since one would run the risk erroneously removing unique fragments, mistaken as duplicates, during the deduplication step. In these cases, the incorporation of UMIs into the workflow can potentially rescue a lot of complexity.
The term “UMI”, or “Unique Molecular Identifier”, as used herein, refers to a tag, consisting of a sequence of degenerate or varying bases, which is used to label original molecules in a sheared nucleic acid sample. In theory, due to the extremely large number of different UMI sequences that can be generated, no two original fragments should have the same UMI sequence. As such, UMIs can be used to determine if two similar sequence reads are each derived from a different, original fragment or if they are simply duplicates, created during PCR amplification of the library, which were derived from the same original fragment.
UMIs are especially useful, when used in combination with start stop sites, for consensus calling of rare sequence variants. For example, if two fragments have the same start and stop site but have a different UMI sequences, what would otherwise have been considered two clones arising from the same original fragment could now be properly designated as unique molecules. As such, the use of UMIs combined with start stop sites often leads to a jump in the coverage number since unique fragments that would have been labeled as duplicates using start stop sites alone will be labelled as unique from each other due to them having different UMIs. It also helps improve the Positive Predictive Value (“PPV”) by removing false positives. There is currently a lot of demand for UMIs, as there are some rare variants that can only be found via consensus calling using UMIs.
“PPV”, or Positive Predictive Value, is the probability that a sequence called as unique is actually unique. PPV=true positive/(true positive+false positive). “Sensitivity” is the probability that a sequence that is unique will be called as unique. Sensitivity=true positive/(true positive+false negative).
Two errors that occur during library construction, and which are reduced by the present invention, are the formation of (1) fragment chimeras and (2) adaptor dimers.
Fragment chimeras are the result of library fragments ligating with one another without the adaptor sequences, resulting in longer fragments that contain unrelated sequences juxtaposed to one another. These unrelated sequences would thus be mistakenly read as a continuous sequence. As such, suppression of fragment chimera formation during library construction is important for reducing downstream sequencing errors.
Adapter dimers are the result of self-ligation of the adapters without a library insert sequence. These dimers form clusters very efficiently, reduce reaction efficiencies, and consume valuable space on the flow cell. This is especially problematic when dealing with ultra-low DNA input quantities in the picogram range. At such low DNA input levels, adapter dimers can constitute a majority of the NGS library molecules formed, thus reducing the amount of useful information generated by DNA sequencing. For this reason, suppression of adapter dimer formation during library construction is a very important but challenging task.
Provided herein are high throughput methods for NGS library construction based on novel adapter ligation strategies that can minimize the formations of both fragment chimeras and adaptor dimers and accurately convert DNA samples into sequencing libraries in under a day. These and other advantages of the invention, as well as additional inventive features, will be apparent from the description of the invention provided herein.