All commercially available next-generation sequencing (NGS) technologies require library preparation, whereby a pair of specific adapter sequences are ligated to the ends of DNA fragments in order to enable sequencing by the instrument. Most NGS adapters comprise three functional domains: (1) unique PCR primer annealing sequences for library and clonal amplification, (2) unique sequencing primer annealing sequences and (3) unique sample indexing sequences. Currently, most platforms utilize clonal amplification to make hundreds of copies of each individual DNA library molecule. This is achieved by bridge amplification or emulsion PCR for the purpose of amplifying the signal generated for the particular mode of sequence detection for each library molecule (e.g. fluorescence or pH). For sequencing by synthesis, annealing domains for sequencing primers are juxtaposed to the adapter-insert junctions; to enable paired-end sequencing, each adapter possesses a unique sequence for primer annealing. Sample index sequences are comprised of short unique sequences, typically 6-8 bases, that when sequenced, identify the sample source of a particular sequence read, enabling samples to be multiplexed or co-sequenced. There are existing and emerging single molecule sequencing technologies that do not rely on clonal amplification for signal detection but still require the attachment of adapter sequences to their termini for other purposes, such as adding a terminal hairpin-loop to DNA duplexes to enable sequencing of both strands as a single molecule or introducing a leader sequence for nanopore entry.
Typically, preparation of an NGS DNA library involves 5 steps: (1) DNA fragmentation, (2) polishing, (3) adapter ligation, (4) size selection, and (5) library amplification (See FIGS. 1 and 2).
(1) Fragmentation: Fragmentation of DNA can be achieved by enzymatic digestion or physical methods such as sonication, nebulization or hydrodynamic shearing. Each fragmentation method has its advantages and limitations. Enzymatic digestion produces DNA ends that can be efficiently polished and ligated to adapter sequences. However, it is difficult to control the enzymatic reaction and produce fragments of predictable length. In addition, enzymatic fragmentation is frequently base-specific thus introducing representation bias into the sequence analysis. Physical methods to fragment DNA are more random and DNA size distribution can be more easily controlled, but DNA ends produced by physical fragmentation are damaged and the conventional polishing reaction is insufficient to generate ample ligation-compatible ends.
(2) Polishing: Typical polishing mixtures contain T4 DNA polymerase and T4 polynucleotide kinase (PNK). The 5′-3′ polymerase and the 3′-5′ exonuclease activities of T4 DNA polymerase excise 3′ overhangs and fill-in 3′ recessed ends, which results in excision of damaged 3′ bases as well as polishing (creation of blunt) DNA ends. The T4 polynucleotide kinase in the polishing mix adds a phosphate to the 5′ ends of DNA fragments that can be lacking such, thus making them ligation-compatible to NGS adapters.
What has remained unknown in the art is that a significant number of 5′ ends produced by physical fragmentation are damaged in an unidentified manner and do not get phosphorylated by PNK. There is no enzyme in a conventional polishing mix that can trim a damaged 5′ terminal base. As a result, a substantial fraction of DNA fragments in the preparation do not get converted into NGS library molecules because they remain ligation incompatible at their 5′ termini to NGS adapters. Although it is known in the art that adapter ligation is inefficient, ligation is typically performed on both strands simultaneously so it has remained unknown which strand is limiting. We separated the reactions into strand-specific ligation to test the efficiency of each, respectively. Through this analysis, we were able to pinpoint the rate limiting step in the overall process to the 5′ termini which, for a significant fraction of the DNA fragments, are poor substrates for PNK and as a result, adapter ligation.
(3) Adapter Ligation: Another factor that contributes to low NGS library yield apart from a lack of 5′ phosphate groups is the ligation reaction itself. Prior to ligation, adenylation of repaired DNA using a DNA polymerase which lacks 3′-5′ exonuclease activity is often performed in order to minimize chimera formation and adapter-adapter (dimer) ligation products. In these methods, single 3′ A-overhang DNA fragments are ligated to single 5′ T-overhang adapters, whereas A-overhang fragments and T-overhang adapters have incompatible cohesive ends for self-ligation. However, the adenylation reaction is incomplete and generates non-specific side products, further reducing the number of available molecules for ligation which reduces library yield. A more efficient, alternative approach to minimize concatamer formation is presented herein.
(4) Size Selection: The size selection process also impacts library yield. During size selection, fragments of undesired size are eliminated from the library using gel or bead-based selection in order to optimize the library insert size for the desired sequencing read length. This maximizes sequence data output by minimizing overlap of paired end sequencing that occurs from short DNA library inserts. In the case of samples with extremely limited input quantities, this step can be skipped, and in exchange for a higher degree of paired-end overlap, more rare fragments are sequenced.
(5) Amplification: The problem of low library yield results in the necessity to amplify libraries by PCR prior to NGS analysis, which leads to loss of library complexity and introduction of base composition bias. The only current solution to avoid this problem is higher quantities of input DNA for library prep, but up to 20% of clinical samples submitted for NGS analysis have insufficient DNA quantity, so instead, additional PCR cycles are applied to overcome the insufficient DNA input. This results in reduced sequence data from the presence of an unacceptable percentage of PCR duplicates.