Precise intramolecular ligation of nucleic acid fragments has many important applications in genomics. In the method described by Collins and Weissman (Proc Natl Acad Sci USA 81:6812-6816, 1984), intramolecular ligation (circularization) of long DNA fragments is employed to juxtapose distal co-linear DNA segments to produce so termed “genomic jumping libraries” to study gene structure at a chromosomal scale (Collins et al., Science 235:1046-1049, 1987). Despite several technical challenges with this method, such as difficulties in producing large, circular DNA molecules and artifacts arising from the generation of intermolecularly ligated DNA species, genomic jumping libraries have contributed to important gene discoveries. One of these discoveries includes the identification of the cystic fibrosis locus (Rommens et al., Science 245:1059-1065, 1989). Building on the approach of Collins and Weissman, the method of Ng et al. (Nature Methods 2:105-111, 2005) circularizes individual cDNAs to link 5′- and 3′-derived “serial analysis of gene expression” (SAGE) tags to produce “paired-end ditags” (PETs) to demarcate gene boundaries.
“Next generation” massively parallel sequencers with their capacity to generate tens of millions of individual sequence reads per instrument run have changed the field of genomics and the related disciplines. For review, see Mardis, Annu Rev Genomics Hum Genet 9:387-402, 2008; and Shendure and Aiden, Nature Biotechnology 30:1084-1094, 2012. Accurate intramolecular ligation is integral to mate-pair or paired-end read technologies for use on the new DNA sequencing platforms to identify human genomic variations and to produce comprehensive scaffolds for de novo genomic assembly. See: Edgren et al., Genome Biol 12:R6, 2011; Hampton et al., Cancer Biol 204:447-457, 2011; Hillmer et al., Genome Res 21:665-675, 2011; Hampton et al., Cancer Genet 204:447-457, 2011; and Wetzel et al., BMC Bioinformatics 12:95, 2011. Despite the potential usefulness of mate-pair sequencing, the method as it is currently practiced is hampered by difficulties in producing circular nucleic acid molecules by intramolecular ligation. Intramolecular ligation is a critical step in the construction of mate-pair libraries, especially those of long and useful separation distances. Competing intermolecular ligation of DNA fragments during mate-pair library construction results in unwanted juxtaposition of random DNA fragments, creating so-called “chimeric” mate-pair reads. The chimeric mate-pair reads constitute unacceptable background for the identification of structural variations and for use in de novo sequence assembly.
The theoretical basis of ligating linear DNA molecules in solution has been described. Jacobson and Stockmayer (J Chem Phys 18:1600-1606, 1950) modeled DNA as a series of rigid segments of length b, joined by freely movable joints, and of total contour length 1. For intramolecular ligation to take place, the effective concentration, j, of one end of a long DNA molecule in neighborhood of the other end can be represented by the equation:j=(3/πlb)3/2 ends per mlThe value for b has been estimated by Hearst and Stockmayer (J Chem Phys 37:1425-1433, 1962) to be 7.2×10−2 micrometer from sedimentation data, leading to the simplification:j=63.4/(kb)1/2 μg/ml,where kb is the length DNA fragment in kilobase.These theoretical bases are consistent with experimental results of Collins and Weisman (Proc Natl Acad Sci USA 81:6812-6816, 1984). When a ligation reaction is carried out at a DNA concentration i, which is less than j, the formation of circles by intermolecular ligation is favored. However, when i is greater than j, intermolecular ligation of DNA fragments yielding chimeric DNA molecules is favored. Accordingly, at any given DNA concentration, i, the fraction of circles formed can be predicted by the equation:% circles=j/(i+1)×100Ligation of DNA is therefore highly dependent on the concentration of DNA ends.
The above theoretical considerations provide the underpinning for the observed difficulties in producing efficient mate-pair libraries, especially of long separation distance. A necessary trade-off to favor intramolecular ligation over intermolecular ligation during mate-pair library construction is that the ligation reaction must be performed at ever increasing dilution as the DNA fragment length increases with a consequent loss of efficiency. Most critically, even when carried out under theoretically optimal conditions favoring intramolecular ligation, there is still a significant background of intermolecular ligation events that is unacceptable for stringent applications such as the generation of scaffolds for de novo assembly of large complex genomes or for the identification of rearrangements in cancer genomes.
Two general methods for constructing mate-pair libraries are presently in use. See Korbel et al., (Science 318:420-426, 2007) and Lok (U.S. Pat. Nos. 7,932,029 and 8,329,400). In the method of Korbel et al., a common biotin-labeled adapter sequence is ligated to the terminal ends of target DNA. The adapter-ligated DNA is then circularized to juxtapose the terminal ends. The resulting circularized molecule is then randomly fragmented, and the newly jointed junction fragments are recovered by biotin affinity chromatography. The recovered fragments are then ligated to sequencing adapters to generate a mate-pair library ready for amplification and sequencing. In the method of Lok, a target DNA fragment is ligated to a short DNA backbone under dilute conditions to create a circular molecule. The bulk of the target DNA insert is then digested with enzymes to create a linear DNA molecule comprising short terminal fragments of the target DNA insert attached to the DNA backbone. This linear DNA fragment is then re-circularized by a second ligation reaction, juxtaposing the terminal regions of the target DNA insert to create the mate-pair library. During the critical circularization steps in both methods, a significant proportion of the ligation products are unwanted intermolecular, chimeric ligation products. The intermolecular products lead to artifactual mate-pair sequence reads, which greatly compromise the library and subsequent data analysis.
Methods and tools for reducing the effects of intermolecular, chimeric ligation products in the generation and analysis of mate-pair libraries, for example, are needed.