Analysis of DNA with currently available techniques provides a spectrum of information ranging from the confirmation that a test DNA is the same or different than a standard sequence or an isolated fragment, to the express identification and ordering of each nucleotide of the test DNA. Not only are such techniques crucial for understanding the function and control of genes and for applying many of the basic techniques of molecular biology, but they have also become increasingly important as tools in genomic analysis and a great many non-research applications, such as genetic identification, forensic analysis, genetic counseling, medical diagnostics and many others. In these latter applications, both techniques providing partial sequence information, such as fingerprinting and sequence comparisons, and techniques providing full sequence determination have been employed (Gibbs et al., Proc. Natl. Acad. Sci USA 1989; 86:1919-1923; Gyllensten et al., Proc. Natl. Acad. Sci USA 1988; 85:7652-7656; Carrano et al., Genomics 1998; 4:129-136; Caetano-Anolles et al., Mol. Gen. Genet. 1992; 235:157-165; Brenner and Livak, Proc. Natl. Acad Sci USA 1989; 86:8902-8906; Green et al., PCR Methods and Applications 1991; 1:77-90; and Versalovic et al., Nucleic Acid Res. 1991; 19:6823-6831).
DNA sequencing methods currently available require the generation of a set of DNA fragments that are ordered by length according to nucleotide composition. The generation of this set of ordered fragments occurs in one of two ways: chemical degradation at specific nucleotides using the Maxam Gilbert method (Maxam A M and W Gilbert, Proc Natl Acad Sci USA 1977; 74:560-564) or dideoxy nucleotide incorporation using the Sanger method (Sanger F, S Nicklen, and A R Coulson, Proc Natl Acad Sci USA 1977; 74:5463-5467) so that the type and number of required steps inherently limits both the number of DNA segments that can be sequenced in parallel, and the number of operations which may be carried out in sequence. Furthermore, both methods are prone to error due to the anomalous migration of DNA fragments in denaturing gels. Time and space limitations inherent in these gel-based methods have fueled the search for alternative methods.
Several methods are under development that are designed to sequence DNA in a solid state format without a gel resolution step. The method that has generated the most interest is sequencing by hybridization. In sequencing by hybridization, the DNA sequence is read by determining the overlaps between the sequences of hybridized oligonucleotides. This strategy is possible because a long sequence can be deduced by matching up distinctive overlaps between its constituent oligomers (Strezoska Z, T Paunesku, D Radosavljevic, I Labat, R Drmanac, R Crkvenjakov, Proc Natl Acad Sci USA 1991; 88:10089-10093; Drmanac R, S Drmanac, Z Strezoska, T Paunesku, I Labat, M Zeremski, J Snoddy, W K Funkhouser, B Koop, L Hood, R Crkvenjakov, Science 1993; 260:1649-1652). This method uses hybridization conditions for oligonucleotide probes that distinguish between complete complementarity with the target sequence and a single nucleotide mismatch, and does not require resolution of fragments on polyacrylamide gels (Jacobs, K A, R Rudersdorf, S D Neill, J P Dougherty, E L Brown, and E F Fritsch, Nucleic Acids Res. 1988; 16:4637-4650). Recent versions of sequencing by hybridization add a DNA ligation step in order to increase the ability of this method to discriminate between mismatches, and to decrease the length of the oligonucleotides necessary to sequence a given length of DNA (Broude N E, T Sano, C L Smith, C R Cantor, Proc. Natl. Acad. Sci. USA 1994;91:3072-3076, Drmanac R T, International Business Communications, Southborough, Mass.). Significant obstacles with this method are its inability to accurately position repetitive sequences in DNA fragments, inhibition of probe annealing by the formation of internal duplexes in the DNA fragments, and the influence of nearest neighbor nucleotides within and adjacent to an annealing domain on the melting temperature for hybridization (Riccelli P V, A S Benight, Nucleic Acids Res 1993;21:3785-3788, Williams J C, S C Case-Green, K U Mir, E M Southern. Nucleic Acids Res 1994;22:1365-1367). Furthermore, sequencing by hybridization cannot determine the length of tandem short repeats, which are associated with several human genetic diseases (Warren S T, Science 1996; 271:1374-1375). These limitations have prevented its use as a primary sequencing method.
The base addition DNA sequencing scheme uses fluorescently labeled reversible terminators of polymerase extension, with a distinct and removable fluorescent label for each of the four nucleotide analogs (Metzker M L, Raghavaehari R, Richards S, Jacutin S E, Civitello A, Burgess K and R A Gibbs, Nucleic Acids Res. 1994; 22:4259-4267; Canard B and R S Sarfati, Gene 1994; 148:16). Incorporation of one of these base analogs into the growing primer strand allows identification of the incorporated nucleotide by its fluorescent label. This is followed by removal of the protecting/fluorescent group, creating a new substrate for template-directed polymerase extension. Iteration of these steps is designed to permit sequencing of a multitude of templates in a solid state format. Technical obstacles, include a relatively low efficiency of extension and deprotection, and interference with primer extension caused by single-strand DNA secondary structure. A fundamental limitation to this approach is inherent in iterative methods that sequence consecutive nucleotides. That is, in order to sequence more than a handful nucleotides, each cycle of analog incorporation and deprotection must approach 100% efficiency. Even if the base addition sequencing scheme is refined so that each cycle occurs at 95% efficiency, one will have  less than 75% of the product of interest after only 6 cycles (0.956=0.735). This will severely limit the ability of this method to sequence anything but very short DNA sequences. Only one cycle of template-directed analog incorporation and deprotection appears to have been demonstrated so far (Metzker M L, Raghavachari R, Richards S, Jacutin S E, Civitello A, Burgess K and R A Gibbs, Nucleic Acids Res. 1994; 22:4259-4267; Canard B and R S Sarfati, Gene 1994; 148:1-6). A related earlier method, which is designed to sequence only one nucleotide per template, uses radiolabeled nucleotides or conventional non-reversible terminators attached to a variety of labels (Sokolov B P, Nucleic Acids Research 1989;18:3671; Kuppuswamy M N, J W Hoffman, C K Kasper, S G Spitzer, S L Groce, and S P Bajaj, Proc. Natl. Acad Sci. USA 1991; 88:1143-1147). Recently, this method has been called solid-phase minisequencing (Syvanen A C, E Ikonen, T Manninen, M Bengstrom, H Soderlund, P Aula, and L Peltonen, Genomics 1992; 12:590-595; Kobayashi M, Rappaport E, Blasband A, Semeraro A, Sartore M, Surrey S, Fortina P., Molecular and Cellular Probes 1995; 9:175-182) or genetic bit analysis (Nikiforov T T, R B Rendle, P Goelet, Y H Rogers, M L Kotewicz, S Anderson, G L Trainor, and M R Knapp, Nucleic Acids Research 1994; 22:4167-4175), and it has been used to verify the parentage of thoroughbred horses (Nikiforov T T, R B Rendle, P Goelet, Y H Rogers, M L Kotewicz, S Anderson, G L Trainor, and M R Knapp, Nucleic Acids Research 1994; 22:4167-4175).
An alternative method for DNA sequencing that remains in the development phase entails the use of flow cytometry to detect single molecules. In this method, one strand of a DNA molecule is synthesized using fluorescently labeled nucleotides, and the labeled DNA molecule is then digested by a processive exonuclease, with identification of the released nucleotides over real time using flow cytometry. Technical obstacles to the implementation of this method include the fidelity of incorporation of the fluorescently labeled nucleotides and turbulence created around the microbead to which the single molecule of DNA is attached (Davis L M, F R Fairfield, C A Harger, J H Jett, R A Keller, J H Hahn, L A Krakowski; B L Marrone, J C Martin, H L Nutter, R L Ratliff, E B Shera, D J Simpson, S A Soper, Genetic Analysis, Techniques, and Applications 1991; 8:1-7). Furthermore, this method is not amenable to sequencing numerous DNA segments in parallel.
Another DNA sequencing method has recently been developed that uses class-IIS restriction endonuclease digestion and adaptor ligation to sequence at least some nucleotides offset from a terminal nucleotide. Using this method, four adjacent nucleotides have reportedly been sequenced and read following the gel resolution of DNA fragments. However, a limitation of this sequencing method is that it has built-in product losses, and requires many iterative cycles (International Application PCT/US95/03678).
Another problem exists with currently available technologies in the area of diagnostic sequencing. An ever widening array of disorders, susceptibilities to disorders, prognoses of disease conditions, and the like, have been correlated with the presence of particular DNA sequences, or the degree of variation (or mutation) in DNA sequences, at one or more genetic loci. Examples of such phenomena include human leukocyte antigen (HLA) typing, cystic fibrosis, tumor progression and heterogeneity, p53 proto-oncogene mutations, and ras proto-oncogene mutations (Gullensten et al., PCR Methods and Applications, 1:91-98 (1991); International application PCT/US92/01675; and International application PCT/CA90/00267). A difficulty in determining DNA sequences associated with such conditions to obtain diagnostic or prognostic information is the frequent presence of multiple subpopulations of DNA, e.g., allelic variants, multiple mutant forms, and the like. Distinguishing the presence and identity of multiple sequences with current sequencing technology is impractical due to the amount of DNA sequencing required.
The present invention provides an alternative approach for sequencing DNA that does not require high resolution separations and that generates signals more amenable to analysis. The methods of the present invention can also be easily automated. This provides a means for readily analyzing DNA from many genetic loci. Furthermore, the DNA sequencing method of the present invention does not require the gel resolution of DNA fragments which allows for the simultaneous sequencing of cDNA or genomic DNA library inserts. Therefore, the full length transcribed sequences or genomes can be obtained very rapidly with the methods of the present invention. The method of the present invention further provides a means for the rapid sequencing of previously uncharacterized viral, bacterial or protozoan human pathogens, as well as the sequencing of plants and animals of interest to agriculture, conservation, and/or science.
The present invention pertains to methods which can sequence multiple DNA segments in parallel, without running a gel. Each DNA sequence is determined without ambiguity, as this novel method sequences DNA in discrete intervals that start at one end of each DNA segment. The method of the present invention is carried out on DNA that is almost entirely double-stranded, thus preventing the formation of secondary structures that complicate the known sequencing methods that rely on hybridization to single-stranded templates (e.g., sequencing by hybridization), and overcoming obstacles posed by microsatellite repeats, other direct repeats, and inverted repeats, in a given DNA segment. The iterative and regenerative DNA sequencing method described herein also overcomes the obstacles to sequencing several thousand distinct DNA segments attached to addressable sites on a matrix or a chip, because it is carried out in iterative steps and in various embodiments effectively preserves the sample through a multitude of sequencing steps, or creates a nested set of DNA segments to which a few steps are applied in common. It is, therefore, highly suitable for automation. Furthermore, the present invention particularly addresses the problem of increasing throughput in DNA sequencing, both in number of steps and parallelism of analyses, and it will facilitate the identification of disease-associated gene polymorphisms, with particular value for sequencing entire genomes and for characterizing the multiple gene mutations underlying polygenic traits. Thus, the invention pertains to novel methods for generating staggered templates and for iterative and regenerative DNA sequencing as well as to methods for automated DNA sequencing.
Accordingly, the invention features a method for identifying a first nucleotide n and a second nucleotide n+x in a double stranded nucleic acid segment. The method includes (a) digesting the double stranded nucleic acid segment with a restriction enzyme to produce a double stranded molecule having a single stranded overhang sequence corresponding to an enzyme cut site; (b) providing an adaptor having a cycle identification tag, a restriction enzyme recognition domain, a sequence identification region, and a detectable label; (c) hybridizing the adaptor to the double stranded nucleic acid having the single-stranded overhang sequence to form a ligated molecule; (d) identifying the nucleotide n by identifying the ligated molecule; (e) amplifying the ligated molecule from step (d) with a primer specific for the cycle identification tag of the adaptor, and (f) repeating steps (a) through (d) on the amplified molecule from step (e) to yield the identity of the nucleotide n+x, wherein x is less than or equal to the number of nucleotides between a recognition domain for a restriction enzyme and an enzyme cut site.
In another aspect, the invention features a method for sequencing an interval within a double stranded nucleic acid segment by identifying a first nucleotide n and a second nucleotide n+x in a plurality of staggered double stranded molecules produced from the double stranded nucleic acid segment. The method includes (a) attaching an enzyme recognition domain to different positions along the double stranded nucleic acid segment within an interval no greater than the distance between a recognition domain for a restriction enzyme and an enzyme cut site, such attachment occurring at one end of the double stranded nucleic acid segment; (b) digesting the double stranded nucleic acid segment with a restriction enzyme to produce a plurality of staggered double stranded molecules each having a single stranded overhang sequence corresponding to the cut site; (c) providing an adaptor having a restriction enzyme recognition domain, a sequence identification region, and a detectable label; (d) hybridizing the adaptor to the double stranded nucleic acid having the single-stranded overhang sequence to form a ligated molecule; (e) identifying a nucleotide n within a staggered double stranded molecule by identifying the ligated molecule; (f) repeating steps (b) through (e) to yield the identity of the nucleotide n+x in each of the staggered double stranded molecules having the single strand overhang sequence thereby sequencing an interval within the double stranded nucleic acid segment, wherein x is greater than one and no greater than the number of nucleotides between a recognition domain for a restriction enzyme and an enzyme cut site.
In another aspect, the invention features a method for identifying a first nucleotide n and a second nucleotide n+x in a double stranded nucleic acid segment The method includes (a) digesting the double stranded nucleic acid segment with a restriction enzyme to produce a double stranded molecule having a 5xe2x80x2 single stranded overhang sequence corresponding to an enzyme cut site; (b) identifying the nucleotide n by template-directed polymerization with a labeled nucleotide or nucleotide terminator, (c) providing an adaptor having a cycle identification tag and a restriction enzyme recognition domain; (d) ligating the adaptor to the double stranded nucleic acid to form a ligated molecule; (e) amplifying the ligated molecule from step (d) with a primer specific for the cycle identification tag of the adaptor; and (f) repeating steps (a) through (b) on the amplified molecule from step (e) to yield the identity of the nucleotide n+x, wherein x is less than or equal to the number of nucleotides between a recognition domain for a restriction enzyme and an enzyme cut site.
Yet another aspect of the invention pertains to a method for sequencing an interval within a double stranded nucleic acid segment by identifying a first nucleotide n and a second nucleotide n+x in a plurality of staggered double stranded molecules produced from the double stranded nucleic acid segment. The method includes (a) attaching an enzyme recognition domain to different positions along the double stranded nucleic acid segment within an interval no greater than the distance between a recognition domain for a restriction enzyme and an enzyme cut site, such an attachment occurring at one end of the double stranded nucleic acid segment; (b) digesting the double stranded nucleic acid segment with a restriction enzyme to produce a plurality of staggered double stranded molecules each having a 5xe2x80x2 single stranded overhang sequence corresponding to the cut site; (c) identifying a nucleotide n within a staggered double stranded molecule by template-directed polymerization with a labeled nucleotide or nucleotide terminator; (d) providing an adaptor having a restriction enzyme recognition domain; e) ligating the adaptor to the double stranded nucleic acid to form a ligated molecule; (f) repeating steps (b) through (c) to yield the identity of the nucleotide n+x in each of the staggered double stranded molecules having the single strand overhang sequence thereby sequencing an interval within the double stranded nucleic acid segment, wherein x is greater than one and no greater than the number of nucleotides between a recognition domain for a restriction enzyme and an enzyme cut site.
The invention also pertains to a method for removing all or a part of a primer sequence from a primer extended product The method includes (a) providing a primer sequence encoding a methylated portion of a restriction endonuclease recognition domain, wherein recognition of the domain by a restriction endonuclease requires at least one methylated nucleotide; (b) polymerizing by a template-directed primer extension using the primer and a nucleic acid segment to generate a primer extended product; and (c) digesting the primer extended product with a restriction endonuclease that recognizes the resulting double-stranded restriction endonuclease recognition domain encoded by the primer sequence in the primer extended product.
A still further aspect of the invention pertains to a method for blocking a restriction endonuclease recognition domain in a primer extended product. The method includes (a) providing a primer with at least one modified nucleotide, wherein the modified nucleotide blocks an enzyme recognition domain, and at least a portion of the enzyme recognition domain sequence is encoded in the primer; (b) polymerizing by a template-directed primer extension using the primer and a nucleic acid segment to generate a primer extended product; and (c) digesting the primer extended product with an enzyme that recognizes a double-stranded enzyme recognition domain in the primer extended product.
In another aspect of the invention there is provided a method and device for automated sequencing of double-stranded DNA segments with nested single strand overhang templates, wherein a plurality of double-stranded DNA segments are immobilized at sites of a microtiter support or chip array having a plurality of sample holders arrayed in a matrix of positions on the support. Each DNA segment has an end comprising a single-strand overhang template sequence no longer than about twenty nucleotides in length. The device then implements a protocol simultaneously treating all sample holders with one or more reagents which selectively react with at least one nucleotide of the single-strand overhang template to effectively label the material at each holder, then reading the array by automated detection to determine at least one nucleotide of the single-strand overhang template at each position. Thereafter, the method proceeds by reducing length of each strand of the DNA segment at each holder by a fixed number n greater than 1 at the overhang end, thus yielding a homologously ordered array of shorter and nested DNA segments, each with a single-strand overhang template sequence, which preferably remain immobilized at the same positions on the support where the treatment protocol is repeated to determine at least one nucleotide at each single-strand overhang sequence. The steps of treating, reading and reducing the length of the strands of the DNA segment at each holder by a number of n greater than 1 nucleotides are iteratively performed as automated process steps to produce nested and progressively shorter DNA segments and to sequence the plurality of DNA segments immobilized at the array of sample holders in situ.
In another aspect the invention includes a method for automated sequencing of double stranded DNA segments by attaching a recognition domain to each segment to form a set of DNA segments having the recognition domain nested at an interval no greater than the distance between the recognition domain and its cut site for a given enzyme that recognizes the recognition domain; treating the DNA segments with an enzyme that recognizes the attached recognition domain and cuts each strand of each DNA segment to create an overhang template at a distance of  greater than 1 nucleotide along the DNA segment from the recognition domain so as to generate a set of nested overhang templates; and determining at least one nucleotide of each of the nested overhang templates. Thereafter, the method proceeds by reducing length of each strand at the end of the DNA segment with the overhang template by  greater than 1 nucleotide to produce a corresponding set of shorter DNA segments each with an overhang template. The step of reducing is performed by removing a block of nucleotides, so that each shorter DNA segment with an overhang template is a known subinterval of a previous DNA segment with overhang.
In another aspect of the invention there is provided a method and device for automated sequencing of double-stranded DNA segments, wherein a plurality of double-stranded DNA segments are immobilized at sites of a microtiter support or chip array having a plurality of sample holders arrayed in a matrix of positions on the support. Each DNA segment has an end comprising a single-strand overhang template sequence no longer than about twenty nucleotides in length. The device then simultaneously treats all sample holders with one or more reagents which selectively react with at least one nucleotide of the single-strand overhang template to effectively label the material at each holder, and reading the array by automated detection to determine at least one nucleotide of the single-strand overhang template at each position. Thereafter, the method proceeds by regenerating material at the respective sample holders by DNA amplification in vitro and reducing length of each strand of the regenerated DNA segment at each holder by a fixed number nxe2x89xa71 at the overhang end, thus yielding a homologously ordered array of shorter and nested DNA segments, each with a single-strand overhang template sequence, which preferably remain immobilized at the same positions on the support, and the treatment protocol is repeated to determine at least one nucleotide at each single-strand overhang sequence. The steps of treating, reading, regenerating and reducing the length of the strands of the DNA segment at each holder by a number of n greater than 1 nucleotides are iteratively performed as automated process steps to produce nested and progressively shorter DNA segment ends and to sequence the plurality of DNA segments immobilized at the array of sample holders in situ.
In another aspect the invention includes a method for automated sequencing of double stranded DNA segments by attaching a recognition domain to each segment to form DNA segments having the recognition domain, regenerating the template precursor by DNA amplification in vitro, treating the DNA segments with an enzyme that recognizes the attached recognition domain and cuts each strand of each DNA segment to create an overhang template at a distance of xe2x89xa71 nucleotide along the DNA segment from the recognition domain, and determining at least one nucleotide of the overhang template. The method includes the step of reducing length of each strand at the end of the DNA segment with the overhang template by  greater than 1 nucleotide to produce a corresponding set of shortened DNA segments each with an overhang template, the step of reducing being performed by removing a block of nucleotides, so that each shortened DNA segment with an overhang template is a known subinterval of a previous DNA segment with overhang.
The invention further contemplates an automated instrument for effectively performing the sequencing, wherein a stage carries the support on a device equipped for providing the respective buffers, solutions and reagents, for stepping or positioning the array for reading, and in some embodiments robotic manipulation for sample transfer, and heating for amplification, e.g., treating at least a portion of material at each sample holder with a primer and heat cycling to regenerate material at the respective sample holders. The stage may be rotatable, spinning to cause fluid provided at a central position to centrifugally flow across the array to alter material immobilized in the sample holders. Preferably the stage holds plural support arrays, and may operate robotically to transfer material from the sites of one support array to the sites of another support array, so that all the samples on one support may undergo one set of process steps in common (e.g., washing, digestion, labeling) while those on the other support undergo another (e.g., heating/amplification or scintillation reading).
Generally, the methods of the invention are applicable to all tasks where DNA sequencing is employed, including medical diagnostics, genetic mapping, genetic identification, forensic anaylsis, molecular biology research, and the like.