There are a variety of methods and applications for which it is desirable to generate a library of fragmented and tagged DNA molecules from double-stranded DNA (dsDNA) target molecules. Often, the purpose is to generate smaller, single-stranded DNA (ssDNA) molecules (e.g., DNA fragments) from larger dsDNA molecules for use as templates in DNA or RNA polymerase reactions (e.g., for use as templates in DNA sequencing reactions or in DNA or RNA amplification reactions in which a primer anneals to the tag and is extended by a polymerase).
Until recently, most DNA sequencing was performed using the Sanger dideoxy chain termination sequencing method, in which a primer is extended by a polymerase using the DNA to be sequenced as a template. Four reactions are conducted, each with a mixture of all canonical nucleotides (dATP, dCTP, dGTP, and dTTP) and one of the four chain-terminating dideoxynucleotide (ddATP, ddCTP, ddGTP, or ddTTP) and each reaction produces a nested set of chain terminated fragments that begin with the primer and terminate with the dideoxynucleotide. When these chain-terminated DNA molecules are separated by size following electrophoresis, the order in which the ddNTPs were incorporated reflects the sequence of the template DNA. Using these methods, the sequence could be determined for a few hundred or a thousand bases from the primer site. Determination of larger sequences required piecing the larger sequence together from overlapping information from numerous clones.
Because these traditional methods require large amounts of DNA template, and because these methods produced poor results if large amounts of non-template DNA is present, Sanger dideoxy sequencing is often performed using a cloned or amplified DNA. For example, most of the sequencing carried out during the Human Genome Project, which formally began in 1990 and culminated with the announcement of the completion of a ‘rough draft’ of the human genome sequence in 2000 and publication of the sequence of the last human chromosome in 2006, was based on using genomic libraries consisting of a population of host bacteria, each of which carried a DNA molecule that was cloned into a DNA vector, such that the collection of all DNA clones, each carrying a piece of the genomic DNA, represented the entire genome. This was a tedious and highly iterative process, involving construction and banking of large numbers of DNA clones (e.g., BAC clones), which, in turn, were often subcloned to generate libraries of smaller DNA clones, which were used as sequencing templates. Often, the primers used in these methods were designed to anneal to the vector such that they would be extended into the unknown cloned DNA during the sequencing reactions. This approach allowed the same set of primers to be used for analyzing many different clones.
In order to decrease the amount of subcloning required for the human genome sequencing project, one method that was sometimes used was “in vitro transposition.” The in vitro transposition method comprises using mobile genetic elements called transposons to insert a small piece of DNA of known sequence into the middle of the unknown DNA. The method comprises incubating a DNA clone from a genomic library with a transposon under conditions wherein a single insertion of the transposon into the DNA clone occurs, then transforming E. coli cells with an aliquot of the in vitro transposition reaction, and selecting cells that contained a marker, such as an antibiotic resistance marker, encoded by the transposon. Thus, the in vitro transposition reaction generates a library of “transposon insertion clones” from the parent DNA clone, each of which contains the transposon inserted at a different location in the DNA clone. Each insertion clone is then sequenced outward from each end of the transposon using a different primer for each DNA strand. As described above, the complete sequence of the parent DNA clone is constructed by overlapping the sequences obtained from different insertion clones. Examples of the use of this transposon insertion method for the Human Genome Project were described by Butterfield, Y S N et al., Nucleic Acids Res 30: 2460-2468, 2002; Shevchenko, Y et al., Nucleic Acids Res 30: 2469-2477, 2002; and Haapa, S et al., Genome Res 9: 308-315, 1999. Use of the in vitro transposition process for the Human Genome Project facilitated the complete sequencing of both genomic DNA clones and clones of cDNA generated from mRNA encoded by the genomic DNA. However, one disadvantage of this in vitro transposition method was that it was not totally in vitro, since it required the steps of transforming E. coli cells, selecting E. coli colonies that contained transposon insertions, and then isolating the DNA from the transposon insertion clones for sequencing.
In order to eliminate the requirement to transform E. coli cells with an aliquot of the in vitro transposition reaction and culture the E. coli cells on selective medium to obtain transposon insertion clones, Teknanen et al. (U.S. Pat. No. 6,593,113) developed totally in vitro transposon-based methods comprising an in vitro transposition reaction and a PCR amplification reaction to select sequencing templates. According to Teknanen et al., the examined DNA or target DNA used in their methods can range from a few base pairs to up to 40 kilobase pairs, with the only limiting factor for not using even longer DNA segments as target DNA being the inability of amplification reactions, such as PCR, to amplify longer segments. Thus, in some embodiments for generating sequencing templates using this method, the examined DNA or target DNA of up to about 40 Kb is first subjected to an in vitro transposition reaction, and then is PCR amplified using, as a first PCR primer, a fixed primer that is complementary to a known sequence in the target DNA or, if the target DNA is cloned in a vector, a fixed primer that is complementary to a sequence in the vector, and as a second PCR primer, a selective primer that is complementary to a sequence of the transposon end to which the target DNA is joined, plus, optionally, one to ten additional nucleotides of known identity at its 3′ end. In another embodiment, two selective primers are used for the PCR amplification step, at least one of which has one to ten additional nucleotides of known identity at its 3′ end. The methods of Teknanen et al. provide certain benefits for Sanger sequencing because they eliminate the need to use E. coli cells to select DNA molecules that have transposon insertions. However, these methods are limited to target DNA of a size up to about 40 Kb and, due to the use of fixed or selective primers, the methods select for DNA molecules that exhibit only a portion of the sequences exhibited by the target DNA. Therefore, although these methods were useful for Sanger sequencing, they are not suitable for generating sequencing templates for the newer “next-generation” DNA sequencing methods, which are capable of generating sequence data from up to millions of sequencing templates in a single sequencing run using a massively parallel or multiplex format.
Next-generation sequencing platforms include the 454 FLX™ or 454 TITANIUM™ (Roche), the SOLEXA™ Genome Analyzer (Illumina), the HELISCOPE™ Single Molecule Sequencer (Helicos Biosciences), and the SOLID™ DNA Sequencer (Life Technologies/Applied Biosystems) instruments), as well as other platforms still under development by companies such as Intelligent Biosystems and Pacific Biosystems. Although the chemistry by which sequence information is generated varies for the different next-generation sequencing platforms, all of them share the common feature of generating sequence data from a very large number of sequencing templates, on which the sequencing reactions are run simultaneously. In general, the data from all of these sequencing reactions are collected using a scanner, and then assembled and analyzed using computers and powerful bioinformatics programs. The sequencing reactions are performed, read, assembled, and analyzed in a “massively parallel” or “multiplex” fashion. The massively parallel nature of these instruments has required a change in thinking about what kind of sequencing templates are needed and how to generate them in order to obtain the maximum possible amounts of sequencing data from these powerful instruments. Thus, rather than requiring genomic libraries of DNA clones in E. coli, it is now necessary to think in terms of in vitro systems for generating DNA fragment libraries comprising a collection or population of DNA fragments generated from target DNA in a sample, wherein the combination of all of the DNA fragments in the collection or population exhibits sequences that are qualitatively and/or quantitatively representative of the sequence of the target DNA from which the DNA fragments were generated. In fact, in some cases, it is necessary to think in terms of generating DNA fragment libraries consisting of multiple genomic DNA fragment libraries, each of which is labeled with a different address tag or bar code to permit identification of the source of each fragment sequenced.
In general, these next-generation sequencing methods require fragmentation of genomic DNA or double-stranded cDNA (prepared from RNA) into smaller ssDNA fragments and addition of tags to at least one strand or preferably both strands of the ssDNA fragments. In some methods, the tags provide priming sites for DNA sequencing using a DNA polymerase. In some methods, the tags also provide sites for capturing the fragments onto a surface, such as a bead (e.g., prior to emulsion PCR amplification for some of these methods; e.g., using methods as described in U.S. Pat. No. 7,323,305). In most cases, the DNA fragment libraries used as templates for next-generation sequencing comprise 5′- and 3′-tagged DNA fragments or “di-tagged DNA fragments.” In general, current methods for generating DNA fragment libraries for next-generation sequencing comprise fragmenting the target DNA that one desires to sequence (e.g. target DNA comprising genomic DNA or double-stranded cDNA after reverse transcription of RNA) using a sonicator, nebulizer, or a nuclease, and joining (e.g., by ligation) oligonucleotides consisting of adapters or tags to the 5′ and 3′ ends of the fragments.
There are a number of problems and inefficiencies with current methods for generating next-generation sequencing templates, as is illustrated by the workflow used at the Wellcome Trust Sanger Institute, one of the world's largest genome centers (e.g., described in Quail, M A et al., Nature Methods 5: 1005-1010, 2008). For example, Quail et al. found that nebulization of genomic DNA for sequencing resulted in loss of approximately half of the DNA by mass and only about 5% of the original DNA consisted of fragments in the approximately 200-bp size range desired for sequencing using the Illumina Genome Analyzer. They found that an alternative method, called “adapted focused acoustics” gave higher yields of fragmented DNA and about 17% of the original DNA consisted of fragments in the desired 200-bp size range, but even this process is wasteful in terms of the sample or target DNA. Still further, the resulting DNA fragments often requires size selection by gel electrophoresis, and additional steps to tag the size-selected DNA fragments, which is difficult, laborious, time-consuming, and expensive.
Thus, many of the methods currently used for fragmentation and tagging of double-stranded DNA for use in next-generation sequencing are wasteful of the DNA, require expensive instruments for fragmentation, and the procedures for fragmentation, tagging and recovering tagged DNA fragments are difficult, tedious, laborious, time-consuming, inefficient, costly, require relatively large amounts of sample nucleic acids. In addition, many of these methods generate tagged DNA fragments that are not fully representative of the sequences contained in the sample nucleic acids from which they were generated. Thus, what is needed in the art are methods for generating libraries of di-tagged DNA fragments in a massively parallel manner that overcome the limitations of the current methods.
Some of the next-generation sequencing methods use circular ssDNA substrates in their sequencing process. For example, U.S. Patent Application Nos. 20090011943; 20090005252; 20080318796; 20080234136; 20080213771; 20070099208; and 20070072208 of Drmanac et al., each incorporated herein by reference, discloses generation of circular ssDNA templates for massively parallel DNA sequencing. U.S. Patent Application No. 20080242560 of Gunderson and Steemers discloses methods comprising: making digital DNA balls (see, e.g., FIG. 8 in U.S. Patent Application No. 20080242560); and/or locus-specific cleavage and amplification of DNA, such as genomic DNA, including for amplification by multiple displacement amplification or whole genome amplification (e.g., FIG. 17 therein) or by hyperbranched RCA (e.g., FIG. 18 therein) for generating amplified nucleic acid arrays (e.g., ILLUMINA BeadArrays™; ILLUMINA, San Diego Calif., USA).
What is needed are improved methods, compositions and kits for making tagged circular ssDNA fragments from DNA from a biological sample (e.g., from genomic DNA or mitochondrial DNA or episomal DNA, including DNA cloned in a plasmid, BAC, fosmid or other episomal vector) for use in amplification or DNA sequencing methods (such as the methods described in U.S. Patent Application Nos. 20090011943; 20090005252; 20080318796; 20080234136; 20080213771; 20070099208; and 20070072208 of Drmanac et al.; or in U.S. Patent Application No. 20080242560 of Gunderson and Steemers or by Turner et al. of Pacific Biosciences and posted on their website at www.pacificbiosciences.com).
Still further, some methods for amplification, such as whole genome amplification, also require fragmentation and tagging of genomic DNA. Some of these methods are reviewed in: Whole Genome Amplification, ed. by S. Hughs and R. Lasken, 2005, Scion Publishing Ltd (on the worldwide web at scionpublishing.com), incorporated herein by reference.
What is needed are improved methods for generating libraries of DNA fragments from target DNA molecules for amplification, including amplification of whole or partial genomes from one organism (e.g., from a clinical sample) or from multiple organisms (e.g., metagenomic target DNA from an environmental sample), for further analysis (e.g., by real-time PCR, emulsion PCR, comparative genomic hybridization (CGH), comparative genomic sequencing (CGS), or for preparing DNA-specific labeled probes (e.g., chromosome-specific probes, e.g., chromosome paints, or e.g., gene-specific probes, e.g., for fluorescent in situ hybridization (FISH), for a variety of purposes (e.g., for research, diagnostic, and industrial purposes).
Thus, what is needed in the art are better and more efficient methods for making libraries of tagged DNA fragments from target DNA for use in nucleic acid analysis methods such as next-generation sequencing and amplification methods. What is needed are methods for generating DNA fragment libraries that do not require specialized instruments, and that are easier, faster, require less hands-on time, can be performed with smaller DNA samples and smaller volumes, are efficient in tagging one or both ends of the fragments, and generate tagged DNA fragments that are qualitatively and quantitatively representative of the target nucleic acids in the sample from which they are generated.