DNA sequencing is the most important analytical tool for understanding the genetic basis of living systems. The process involves determining the positions of each of the four major nucleotide bases, adenine (A), cytosine (C), guanine (G), and thymine (T) along the DNA molecule(s) of an organism. Short sequences of DNA are usually determined by creating a nested set of DNA fragments that begin at a unique site and terminate at a plurality of positions comprised of a specific base. The fragments terminated at each of the four natural nucleic acid bases (A, T, G and C) are then separated according to molecular size in order to determine the positions of each of the four bases relative to the unique site. The pattern of fragment lengths caused by strands that terminate at a specific base is called a “sequencing ladder.” The interpretation of base positions as the result of one experiment on a DNA molecule is called a “read.” There are different methods of creating and separating the nested sets of terminated DNA molecules (Adams et al., 1994; Primrose, 1998; Cantor and Smith, 1999).
Because the amount of any specific DNA molecule that can be isolated from even a large number of cells is usually very small, the only practical methods to prepare enough DNA molecules for most applications, including sequencing, involve amplification of specific DNA molecules in vivo or in vitro. There are basically six general methods important for manipulating DNA for analysis: 1) in vivo cloning of unique fragments of DNA; 2) in vitro amplification of unique fragments of DNA; 3) in vivo cloning of libraries (mixtures) of DNA fragments; 4) in vitro preparation of random libraries of DNA fragments; 5) in vivo cloning of ordered libraries of DNA; and 6) in vitro preparation of ordered libraries of DNA. The beneficial effect of amplifying mixtures of DNA is that it facilitates analysis of large pieces of DNA (e.g., chromosomes) by creating libraries of molecules that are small enough to be analyzed by existing techniques. For example the largest molecule that can be subjected to DNA sequencing methods is less than 2000 bases long, which is many orders of magnitude shorter than single chromosomes of organisms. Although short molecules can be analyzed, considerable effort is required to assemble the information from the analysis of the short molecules into a description of the larger piece of DNA.
1. In vivo Cloning of Unique DNA
Unique-sequence source DNA molecules can be amplified by separating them from other molecules (e.g., by electrophoresis), ligating them into an autonomously replicating genetic element (e.g., a bacterial plasmid), transfecting a host cell with the recombinant genetic element, and growing a clone of a single transfected host cell to produce many copies of the genetic element having the insert with the same unique sequence as the source DNA (Sambrook, et aL, 1989).
2. In vitro Amplification of Unique DNA
There are many methods designed to amplify DNA in vitro. Usually these methods are used to prepare unique DNA molecules from a complex mixture, e.g., genomic DNA or an artificial chromosome. Alternatively, a restricted set of molecules can be prepared as a library that represents a subset of sequences in the complex mixture. These amplification methods include PCR™, rolling circle amplification, and strand displacement (Walker, et al., 1996a; Walker, et al. 1996b; U.S. Pat. Nos. 5,648,213; 6,124,120).
The polymerase chain reaction (PCR™) can be used to amplify specific regions of DNA between two known sequences (U.S. Pat. Nos. 4,683,195, 4,683,202; Frohman et al., 1995). PCR™ involves the repetition of a cycle consisting of denaturation of the source (template) DNA, hybridization of two oligonucleotide primers to known sequences flanking the region to the amplified, primer extension using a DNA polymerase to synthesize strands complementary to the DNA region located between the two primer sites. Because the products of one cycle of amplification serve as source DNA for succeeding cycles, the amplification is exponential. PCR™ can synthesize large numbers of specific molecules quickly and inexpensively.
The major disadvantages of the PCR™ method to amplify DNA are that 1) information about two flanking sequences must be known in order to specify the sequences of the primers; 2) synthesis of primers is expensive; 3) the level of amplification achieved depends strongly on the primer sequences, source DNA sequence, and the molecular weight of the amplified DNA; and 4) the length of amplified DNA is usually limited to less than 5 kb, although “long-distance” PCR™ (Cheng, 1994) allows molecules as long as 20 kb to be amplified.
“One-sided PCR™” techniques are able to amplify unknown DNA adjacent to one known sequence. These techniques can be divided into 4 categories: a) ligation-mediated PCR™, facilitated by addition of a universal adaptor sequence to a terminus usually created by digestion with a restriction endonuclease; b) universal primer-mediated PCR™, facilitated by a primer extension reaction initiated at arbitrary sites c) terminal transferase-mediated PCR™, facilitated by addition of a homonucleotide “tail” to the 3′ end of DNA fragments; and d) inverse PCR™, facilitated by circularization of the template molecules. These techniques can be used to amplify successive regions along a large DNA template in a process sometimes called “chromosome walking” (Hui et al., 1998).
Ligation-mediated PCR™ is practiced in many forms. Rosenthal et al. (1990) outlined the basic process of amplifying an unknown region of DNA immediately adjacent to a known sequence located near the end of a restriction fragment. Reiley et al. (1990) used primers that were not exactly complementary with the adaptors in order to suppress amplification of molecules that did not have a specific priming site. Jones (1993) and Siebert (1995; U.S. Pat. No. 5,565,340) used long universal primers that formed intrastrand “panhandle” structures that suppressed PCR™ of molecules having two universal adaptors. Arnold (1994) used “vectorette” primers having unpaired central regions to increase the specificity of one-sided PCR™. Macrae and Brenner (1994) amplified short inserts from a Fugu genomic clone library using nested primers from a specific sequence and from vector sequences. Lin et al. (1995) ligated an adaptor to restriction fragment ends that had an overhanging 5′ end and employed hot-start PCR™ with a single universal anchor primer and nested specific-site primers to specifically amplify human sequences. Liao et al. (1997) used two specific site primers and 2 universal adaptors, one of which had a blocked 3′ end to reduce non-specific background, to amplify zebrafish promoters. Devon et al. (1995) used “splinkerette-vectorette” adaptors with special secondary structure in order to decrease non-specific amplification of molecules with two universal sequences during ligation-mediated PCR™. Padegimas and Reichert (1998) used phosphorothioate-blocked oligonucleotides and exoIII digestion to remove the unligated and partially ligated molecules from the reactions before performing PCR™, in order to increase the specificity of amplification of maize sequences. Zhang and Gurr (2000) used ligation-mediated hot-start PCR™ of restriction fragments using nested primers in order to amplify up to 6 kb of a fungal genome. The large amplicons were subsequently directly sequenced using primer extension.
To increase the specificity of ligation-mediated PCR™ products, many methods have been used to “index” the amplification process by selection for specific sequences adjacent to one or both termini (e.g., Smith, 1992; Unrau, 1994; Guilfoyle, 1997; U.S. Pat. No. 5,508,169).
One-sided PCR™ can also be achieved by direct amplification using a combination of unique and non-unique primers. Liu and Whittier (1995) developed an efficient PCR strategy, thermal asymmetric interlaced (TAIL)-PCR, that utilizes nested sequence-specific primers together with a shorter arbitrary degenerate primer so that the relative amplification efficiencies of specific and non-specific products can be thermally controlled. Harrison et al. (1997) performed one-sided PCR™ using a degenerate oligonucleotide primer that was complementary to an unknown sequence and three nested primers complementary to a known sequence in order to sequence transgenes in mouse cells. U.S. Pat. No. 5,994,058 specifies using a unique PCR™ primer and a second, partially degenerate PCR™ primer to achieve one-sided PCR™. Weber et al. (1998) used direct PCR™ of genomic DNA with nested primers from a known sequence and 1-4 primers complementary to frequent restriction sites. This technique does not require restriction digestion and ligation of adaptors to the ends of restriction fragments,
Terminal transferase can also be used in one-sided PCR™. Cormack and Somssich (1997) were able to amplify the termini of genomic DNA fragments using a method called RAGE (rapid amplification of genome ends) by a) restricting the genome with one or more restriction enzymes; b) denaturing the restricted DNA; c) providing a 3′ polythymidine tail using terminal transferase; and d) performing two rounds of PCR™ using nested primers complementary to a known sequence as well as the adaptor. Rudi et al. (1999) used terminal transferase to achieve chromosome walking in bacteria using a method of one-sided PCR™ that is independent of restriction digestion by a) denaturation of the template DNA; b) linear amplification using a primer complementary to a known sequence; c) addition of a poly C “tail” to the 3′ end of the single-stranded products of linear amplification using a reaction catalyzed by terminal transferase; and d) PCR™ amplification of the products using a second primer within the known sequence and a poly-G primer complementary to the poly-C tail in the unknown region. The products amplified by Rudi (1999) have a very broad size distribution, probably caused by a broad distribution of lengths of the linearly-amplified DNA molecules.
RNA polymerase can also be used to achieve one-sided amplification of DNA. U.S. Pat. No. 6,027,913 shows how one-sided PCR™ can be combined with transcription with RNA polymerase to amplify and sequence regions of DNA with only one known sequence.
Inverse PCR™ (Ochman et al., 1988) is another method to amplify DNA based on knowledge of a single DNA sequence. The template for inverse PCR™ is a circular molecule of DNA created by a complete restriction digestion, which contains a small region of known sequence as well as adjacent regions of unknown sequence. The oligonucleotide primers are oriented such that during PCR™ they give rise to primer extension products that extend way from the known sequence. This “inside-out” PCR™ results in linear DNA products with known sequences at the termini.
The disadvantages of all “one-sided PCR™” methods is that a) the length of the products are restricted by the limitation of PCR™ (normally about 2 kb, but with special reagents up to 50 kb); b) whenever the products are single DNA molecules longer than 1 kb they are too long to directly sequence; c) in ligation-mediated PCR™ the amplicon lengths are very unpredictable due to random distances between the universal priming site and the specific priming site(s), resulting in some products that are sometimes too short to walk significant distance, some which are preferentially amplified due to small size, and some that are too long to amplify and analyze; and d) in methods that use terminal transferase to add a polynucleotide tail to the end of a primer extension product, there is great heterogeneity in the length of the amplicons due to sequence-dependent differences in the rate of primer extension.
Strand displacement amplification (Walker, et al. 1996a; Walker, et al. 1996b; U.S. Pat. Nos. 5,648,213; 6,124,120) is a method to amplify one or more termini of DNA fragments using an isothermal strand displacement reaction. The method is initiated at a nick near the terminus of a double-stranded DNA molecule, usually generated by a restriction enzyme, followed by a polymerization reaction by a DNA polymerase that is able to displace the strand complementary to the template strand. Linear amplification of the complementary strand is achieved by reusing the template multiple times by nicking each product strand as it is synthesized. The products are strands with 5′ ends at a unique site and 3′ ends that are various distances from the 5′ ends. The extent of the strand displacement reaction is not controlled and therefore the lengths of the product strands are not uniform. The polymerase used for strand displacement amplification does not have a 5′ exonuclease activity.
Rolling circle amplification (U.S. Pat. No. 5,648,245) is a method to increase the effectiveness of the strand displacement reaction by using a circular template. The polymerase, which does not have a 5′ exonuclease activity, makes multiple copies of the information on the circular template as it makes multiple continuous cycles around the template. The length of the product is very large—typically too large to be directly sequenced. Additional amplification is achieved if a second strand displacement primer is added to the reaction to used the first-strand displacement product as a template.
3. In vivo Cloning of DNA of Random Libraries
Libraries are collections of small DNA molecules that represent all parts of a larger DNA molecule or collection of DNA molecules (Primrose, 1998; Cantor and Smith, 1999). Libraries can be used for analytical and preparative purposes. Genomic clone libraries are the collection of bacterial clones containing fragments of genomic DNA. cDNA clone libraries are collections of clones derived from mRNA molecules.
Cloning of non-specific DNA is commonly used to separate and amplify DNA for analysis. DNA from an entire genome, one chromosome, a virus, or a bacterial plasmid is fragmented by a suitable method (e.g., hydrodynamic shearing or digestion with restriction enzymes), ligated into a special region of a bacterial plasmid or other cloning vector, transfected into competent cells, amplified as a part of a plasmid or chromosome during proliferation of the cells, and harvested from the cell culture. Critical to the specificity of this technique is the fact that the mixture of cells carrying different DNA inserts can be diluted and aliquoted such that some of the aliquots, whether on a surface or in a volume of solution, contain a single transfected cell containing a unique fragment of DNA. Proliferation of this single cell (in vivo cloning) amplifies this unique fragment of DNA so that it can be analyzed. This “shotgun” cloning method is used very frequently, because: 1) it is inexpensive; 2) it produces very pure sequences that are usually faithful copies of the source DNA; 3) it can be used in conjunction with clone screening techniques to create an unlimited amount of specific-sequence DNA; 4) it allows simultaneous amplification of many different sequences; 5) it can be used to amplify DNA as large as 1,000,000 bp long; and 6) the cloned DNA can be directly used for sequencing and other purposes.
Cloning is inexpensive, because many pieces of DNA can be simultaneously transfected into host cells. The general term for this process of mixing a number of different entities (e.g., electronic signals or molecules) is “multiplexing,” and is a common strategy for increasing the number of signals or molecules that can be processed simultaneously and subsequently separated to recover the information about the individual signals or molecules. In the case of conventional cloning, the recovery process involves diluting the bacterial culture such that an aliquot contains a single bacterium carrying a single plasmid, allowing the bacterium to multiply to create many copies of the original plasmid, and isolating the cloned DNA for further analysis.
The principle of multiplexing different molecules in the same transfection experiment is critical to the economy of the cloning method. However, after the transfection each clone must be grown separately and the DNA isolated separately for analysis. These steps, especially the DNA isolation step, are costly and time consuming. Several attempts have been made to multiplex steps after cloning, whereby hundreds of clones can be combined during the steps of DNA isolation and analysis and the characteristics of the individual DNA molecules recovered later. In one version of multiplex cloning the DNA fragments are separated into a number of pools (e.g., one hundred pools). Each pool is ligated into a different vector, possessing a nucleic acid tag with a unique sequence, and transfected into the bacteria. One clone from each transfection pool is combined with one clone from each of the other transfection pools in order to create a mixture of bacteria having a mixture of inserted sequences, where each specific inserted sequence is tagged with a unique vector sequence, and therefore can be identified by hybridization to the nucleic acid tag. This mixture of cloned DNA molecules can be subsequently separated and subjected to any enzymatic, chemical, or physical processes for analysis such as treatment with polymerase or size separation by electrophoresis. The information about individual molecules can be recovered by detection of the nucleic acid tag sequences by hybridization, PCR™ amplification, or DNA sequencing. Church has shown methods and compositions to use multiplex cloning to sequence DNA molecules by pooling clones tagged with different labels during the steps of DNA isolation, sequencing reactions, and electrophoretic separation of denatured DNA strands (U.S. Pat. Nos. 4,942,124 and 5,149,625). The tags are added to the DNA as parts of the vector DNA sequences. The tags used can be detected using oligonucleotides labeled with radioactivity, fluorescent groups, or volatile mass labels (Cantor and Smith, 1999; U.S. Pat. Nos. 4,942,124; 5,149,625; and 5,112,736; Richterich and Church, (1993)). A later patent was directed to a technique whereby the tag sequences are ligated to the DNA fragments before cloning using a universal vector (U.S. Pat. No. 5,714,318). Another patent specifies a method whereby the tag sequences added before transfection are amplified using PCR™ after electrophoretic separation of the denatured DNA (PCT WO 98/15644).
4. In vitro Preparation of DNA as Random Libraries
DNA libraries can be formed in vitro and subjected to various selection steps to recover information about specific sequences. In vitro libraries are rarely used in genomics, because the methods that exist for creating such libraries do not offer advantages over cloned libraries. In particular, the methods used to amplify the in vitro libraries are not able to amplify all the DNA in an unbiased manner, because of the size and sequence dependence of amplification efficiency. PCT WO 00/18960 describes how different methods of DNA amplification can be used to create a library of DNA molecules representing a specific subset of the sequences within the genome for purposes of detecting genetic polymorphisms. “Random-prime PCR™” (U.S. Pat. Nos. 5,043,272; 5,487,985) “random-prime strand displacement” (U.S. Pat. No. 6,124,120) and “AFLP” (U.S. Pat. No. 6,045,994) are three examples of methods to create libraries that represent subsets of complex mixtures of DNA molecules.
Single-molecule PCR™ can be used to amplify individual randomly-fragmented DNA molecules (Lukyanov et al, 1996). In one method, the source DNA is first fragmented into molecules usually less than 10,000 bp in size, ligated to adaptor oligonucleotides, and extensively diluted and aliquoted into separate fractions such that the fractions often contain only a single molecule. PCR™ amplification of a fraction containing a single molecule creates a very large number of molecules identical to one of the original fragments. If the molecules are randomly fragmented, the amplified fractions represent DNA from random positions within the source DNA.
WO0015779A2 describes how a specific sequence can be amplified from a library of circular molecules with random genomic inserts using rolling circle amplification.
5. Direct in vivo Cloning of Ordered Libraries of DNA
Directed cloning is a procedure to clone DNA from different parts of a larger piece of DNA, usually for the purpose of sequencing DNA from a different positions along the source DNA. Methods to clone DNA with “nested deletions” have been used to make “ordered libraries” of clones that have DNA starting at different regions along a long piece of source DNA. In one version, one end of the source DNA is digested with one or more exonuclease activities to delete part of the sequence (McCombie et al., 1991; U.S. Pat. No. 4,843,003). By controlling the extent of exonuclease digestion, the average amount of the deletion can be controlled. The DNA molecules are subsequently separated based on size and cloned. By cloning molecules with different molecular weights, many copies of identical DNA plasmids are produced that have inserts ending at controlled positions within the source DNA. Transposon insertion (Berg et al., 1994) is also used to clone different regions of source DNA by facilitating priming or cleavage at random positions in the plasmids. The size separation and recloning steps make both of these methods labor intensive and slow. They are generally limited to covering regions less than 10 kb in size and cannot be used directly on genomic DNA but rather cloned DNA molecules. No in vivo methods are known to directly create ordered libraries of genomic DNA.
6. Direct in vitro Preparation of Ordered Libraries of DNA
Ordered libraries have not been frequently created in vitro. Hagiwara (1996) used one-sided PCR™ to create an ordered library of PCR™ products that was used to sequence about 14 kb of a cosmid. The cosmids were first digested with multiple restriction enzymes, followed by ligation of vectorette adaptors to the products, PCR™ amplification of the products using primers complementary to a unique sequence in the cosmid and to the adaptor, size separation of the amplified DNA to establish the order of the restriction sites, and sequencing of the ordered PCR™ products. Because the non-uniform spacing of the restriction sites, 2 kb of the 16 kb region were not sequenced. This method required substantial effort to produce and order the PCR™ products for the job of sequencing cloned DNA. No in vitro methods are known to directly create ordered genomic libraries of DNA.
7. Preparation of DNA
In methods known and used in the art, molecules for sequencing are prepared (see, for example, Sambrook et al. (1989) or Ausubel et al. (1994)).
Furthermore, Japan Patent No. JP8173164A2 describes a method of preparing DNA by sorting-out PCR™ amplification in the absence of cloning, fragmenting a double-stranded DNA, ligating a known-sequence oligomer to the cut end, and amplifying the resultant DNA fragment with a primer having the sorting-out sequence complementary to the oligomer. The sorting-out sequences consist of a fluorescent label and one to four bases at 5′ and 3′ termini to amplify the number of copies of the DNA fragment.
U.S. Pat. No. 6,107,023 describes a method of isolating duplex DNA fragments which are unique to one of two fragment mixtures, i.e., fragments which are present in a mixture of duplex DNA fragments derived from a positive source, but absent from a fragment mixture derived from a negative source. In practicing the method, double-strand linkers are attached to each of the fragment mixtures, and the number of fragments in each mixture is amplified by successively repeating the steps of (i) denaturing the fragments to produce single fragment strands; (ii) hybridizing the single strands with a primer whose sequence is complementary to the linker region at one end of each strand, to form strand/primer complexes; and (iii) converting the strand/primer complexes to double-strand fragments in the presence of polymerase and deoxynucleotides. After the desired fragment amplification is achieved, the two fragment mixtures are denatured, then hybridized under conditions in which the linker regions associated with the two mixtures do not hybridize. DNA species which are unique to the positive-source mixture, i.e., which are not hybridized with DNA fragment strands from the negative-source mixture, are then selectively isolated.
U.S. Pat. No. 6,114,149 regards a method of amplifying a mixture of different-sequence DNA fragments that may be formed from RNA transcription, or derived from genomic single- or double-stranded DNA fragments. The fragments are treated with terminal deoxynucleotide transferase and a selected deoxynucleotide, to form a homopolymer tail at the 3′ end of the anti-sense strands, and the sense strands are provided with a common 3′-end sequence. The fragments are mixed with a homopolymer primer that is homologous to the homopolymer tail of the anti-sense strands, and a defined-sequence primer which is homologous to the sense-strand common 3′-end sequence, with repeated cycles of fragment denaturation, annealing, and polymerization, to amplify the fragments. In one embodiment, the defined-sequence and homopolymer primers are the same, i.e., only one primer is used. The primers may contain selected restriction-site sequences, to provide directional restriction sites at the ends of the amplified fragments.
Thus, the present invention provides a new way of preparing DNA templates for more efficient sequencing of difficult DNA molecules, higher sequence quality, and longer reads.