The present invention relates generally to methods and compositions for determining the sequence of nucleic acid molecules, and more specifically, to methods and compositions which allow the determination of multiple nucleic acid sequences simultaneously.
Deoxyribonucleic acid (DNA) sequencing is one of the basic techniques of biology. It is at the heart of molecular biology and plays a rapidly expanding role in the rest of biology. The Human Genome Project is a multi-national effort to read the entire human genetic code. It is the largest project ever undertaken in biology, and has already begun to have a major impact on medicine. The development of cheaper and faster sequencing technology will ensure the success of this project. Indeed, a substantial effort has been funded by the NIH and DOE branches of the Human Genome Project to improve sequencing technology, however, without a substantial impact on current practices (Sulston and Waterston, Nature 376:175, 1995).
In the past two decades, determination and analysis of nucleic acid sequence has formed one of the building blocks of biological research. This, along with new investigational tools and methodologies, has allowed scientists to study genes and gene products in order to better understand the function of these genes, as well as to develop new therapeutics and diagnostics.
Two different DNA sequencing methodologies that were developed in 1977, are still in wide use today. Briefly, the enzymatic method described by Sanger (Proc. Natl. Acad. Sci. (USA) 74:5463, 1977) which utilizes dideoxy-terminators, involves the synthesis of a DNA strand from a single-sanded template by a DNA polymerase. The Sanger method of sequencing depends on the fact that that dideoxynucleotides (ddNTPs) are incorporated into the growing strand in the same way as normal deoxynucleotides (albeit at a lower efficiency). However, ddNTPs differ from normal deoxynucleotides (dNTPs) in that they lack the 3xe2x80x2-OH group necessary for chain elongation. When a ddNTP is incorporated into the DNA chain, the absence of the 3xe2x80x2-hydroxy group prevents the formation of a new phosphodiester bond and the DNA fragment is terminated with the ddNTP complementary to the base in the template DNA. The Maxam and Gilbert method (Maxam and Gilbert, Proc. Natl. Acad. Sci. (USA) 74:560, 1977) employs a chemical degradation method of the original DNA (in both cases the DNA must be clonal). Both methods produce populations of fragments that begin from a particular point and terminate in every base that is found in the DNA fragment that is to be sequenced. The termination of each fragment is dependent on the location of a particular base within the original DNA fragment. The DNA fragments are separated by polyacrylamide gel electrophoresis and the order of the DNA bases (adenine, cytosine, thymine, guanine; also known as A,C,T,G, respectively) is read from a autoradiograph of the gel.
A cumbersome DNA pooling sequencing strategy (Church and Kieffer-Higgins, Science 24:185, 1988) is one of the more recent approaches to DNA sequencing. A pooling sequencing strategy consists of pooling a number of DNA templates (samples) and processing the samples as pools. In order to separate the sequence information at the end of the processing, the DNA molecules of interest are ligated to a set of oligonucleotide xe2x80x9ctagsxe2x80x9d at the beginning. The tagged DNA molecules are pooled amplified and chemically fragmented in 96-well plates. After electrophoresis of the pooled samples, the DNA is transferred to a solid support and then hybridized with a sequential series of specific labeled oligonucleotides. These membranes are then probed as many times as there are tags in the original pool, producing, in each set of probing, autoradiographs similar to those from standard DNA sequencing methods. Thus each reaction and gel yields a quantity of data equivalent to that obtained from conventional reactions and gels multiplied by the number of probes used. If alkaline phosphatase is used as the reporter enzyme, 1,2-dioxetane substrate can be used which is detected in a chemiluminescent assay format. However, this pooling strategy""s major disadvantage is that the sequences can only be read by Southern blotting the sequencing gel and hybridizing this membrane once for each clone in the pool.
In addition to advances in sequencing methodologies, advances in speed have occurred due to the advent of automated DNA sequencing. Briefly, these methods use fluorescent-labeled primers which replace methods which employed radiolabeled components. Fluorescent dyes are attached either to the sequencing primers or the ddNTP-terminators. Robotic components now utilize polymerase chain reaction (PCR) technology which has lead to the development of linear amplification strategies. Current commercial sequencing allows all 4 dideoxy-terminator reactions to be run on a single lane. Each dideoxy-terminator reaction is represented by a unique fluorescent primer (one fluorophore for each base type: A,T,C,G). Only one template DNA (i.e., DNA sample) is represented per lane. Current gels permit the simultaneous electrophoresis of up to 64 samples in 64 different lanes. Different ddNTP-terminated fragments are detected by the irradiation of the gel lane by light followed by detection of emitted light from the fluorophore. Each electrophoresis step is about 4-6 hours long. Each electrophoresis separation resolves about 400-600 nucleotides (nt), therefore, about 6000 nt can be sequenced per hour per sequencer.
The use of mass spectrometry for the study of monomeric constituents of nucleic acids has also been described (Hignite, In Biochemical Applications of Mass Spectrometry, Waller and Dermer (eds.), Wiley-Interscience, Chapter 16, p. 527, 1972). Briefly, for larger oligomers, significant early success was obtained by plasma desorption for protected synthetic oligonucleotides up to 14 bases long, and for unprotected oligos up to 4 bases in length. As with proteins, the applicability of ESI-MS to oligonucleotides has been demonstrated (Covey et al., Rapid Comm. in Mass Spec. 2:249-256, 1988). These species are ionized in solution, with the charge residing at the acidic bridging phosphodiester and/or terminal phosphate moieties, and yield in the gas phase multiple charged molecular anions, in addition to sodium adducts.
Sequencing DNA with  less than 100 bases by the common enzymatic ddNTP technique is more complicated than it is for larger DNA templates, so that chemical degradation is sometimes employed. However, the chemical decomposition method requires about 50 pmol of radioactive 32P end-labeled material, 6 chemical steps, electrophoretic separation, and film exposure. For small oligonucleotides ( less than 14 nts) the combination of electrospray ionization (ESI) and Fourier transform (FT) mass spectrometry (MS) is far faster and more sensitive. Dissociation products of multiply-charged ions measured at high (105) resolving power represent consecutive backbone cleavages providing the full sequence in less than one minute on sub-picomole quantity of sample (Little et al., J. Am. Chem. Soc. 116:4893, 1994). For molecular weight measurements, ESI/MS has been extended to larger fragments (Potier et al., Nuc. Acids Res. 22:3895, 1994). ESI/FTMS appears to be a valuable complement to classical methods for sequencing and pinpoint mutations in nucleotides as large as 100-mers. Spectral data have recently been obtained loading 3xc3x9710xe2x88x9213 mol of a 50-mer using a more sensitive ESI source (Valaskovic, Anal. Chem. 68:259, 1995).
The other approach to DNA sequencing by mass spectrometry is one in which DNA is labeled with individual isotopes of an element and the mass spectral analysis simply has to distinguish the isotopes after a mixtures of sizes of DNA have been separated by electrophoresis. (The other approach described above utilizes the resolving power of the mass spectrometer to both separate and detect the DNA oligonucleotides of different lengths, a difficult proposition at best.) All of the procedures described below employ the Sanger procedure to convert a sequencing primer to a series of DNA fragments that vary in length by one nucleotide. The enzymatically synthesized DNA molecules each contain the original primer, a replicated sequence of part of the DNA of interest, and the dideoxy terminator. That is, a set of DNA molecules is produced that contain the primer and differ in length by from each other by one nucleotide residue.
Brennen et al. (Biol. Mass Spec., New York, Elsevier, p. 219, 1990) has described methods to use the four stable isotopes of sulfur as DNA labels that enable one to detect DNA fragments that have been separated by capillary electrophoresis. Using the xcex1-thio analogues of the ddNTPs, a single sulfur isotope is incorporated into each of the DNA fragments. Therefore each of the four types of DNA fragments (ddTTP, ddATP, ddGTP, ddCTP-terminated) can be uniquely labeled according to the terminal nucleotide; for example, 32S for fragments ending in A, 33S for G, 34S for C, and 36S for T, and mixed together for electrophoresis column, fractions of a few picoliters are obtained by a modified ink-jet printer head, and then subjected to complete combustion in a furnace. This process oxidizes the thiophosphates of the labeled DNA to SO2, which is subjected to analysis in a quadrupole or magnetic sector mass spectrometer. The SO2 mass unit representation is 64 for fragments ending in A, 65 for G, 66 for C, and 68 for T. Maintenance of the resolution of the DNA fragments as they emerge from the column depends on taking sufficiently small fractions. Because the mass spectrometer is coupled directly to the capillary gel column, the rate of analysis is determined by the rate of electrophoresis. This process is unfortunately expensive, liberates radioactive gas and has not been commercialized. Two other basic constraints also operate on this approach: (a) No other components with mass of 64, 65, 66, or 68 (isobaric contaminants) can be tolerated and (b) the % natural abundances of the sulfur isotopes (32S is 95.0, 33S is 0.75, 34S is 4.2, and 36S is 0.11) govern the sensitivity and cost. Since 32S is 95% naturally abundant, the other isotopes must be enriched to  greater than 99% to eliminate contaminating 32S Isotopes that are  less than 1% abundant are quite expensive to obtain at 99% enrichment; even when 36S is purified 100-fold it contains as much or more 34S as it does 36S.
Gilbert has described an automated DNA sequencer (EPA, 92108678.2) that consists of an oligomer synthesizer, an array on a membrane, a detector which detects hybridization and a central computer. The synthesizer synthesizes and labels multiple oligomers of arbitrary predicted sequence. The oligomers are used to probe immobilized DNA on membranes. The detector identifies hybridization patterns and then sends those patterns to a central computer which constructs a sequence and then predicts the sequence of the next round of synthesis of oligomers. Through an iterative process, a DNA sequence can be obtained in an automated fashion.
Brennen has described a method for sequencing nucleic acids based on ligation of oligomers (U.S. Pat. No. 5,403,708). Methods and compositions are described for forming ligation product hybridized to a nucleic acid template. A primer is hybridized to a DNA template and then a pool of random extension oligonucleotides is also hybridized to the primed template in the presence ligase(s). The ligase enzyme covalently ligates the hybridized oligomers to the primer. Modifications permit the determination of the nucleotide sequence of one or more members of a first set of target nucleotide residues in the nucleic acid template that are spaced at intervals of N nucleotides. In this method, the labeled ligated product is formed wherein the position and type of label incorporated into the ligation product provides information concerning the nucleotide residue in the nucleic acid template with which the labeled nucleotide residue is base paired.
Koster has described an method for sequencing DNA by mass spectrometry after degradation of DNA by an exonuclease (PCT/US94/02938). The method described is simple in that DNA sequence is directly determined (the Sanger reaction is not used). DNA is cloned into standard vectors, the 5xe2x80x2 end is immobilized and the strands are then sequentially degraded at the 3xe2x80x2 end via an exonuclease and the enzymatic product (nucleotides) are detected by mass spectrometry.
Weiss et al. have described an automated hybridization/imaging device for fluorescent multiplex DNA sequencing (PCT/US94/11918). The method is based on the concept of hybridizing enzyme-linked probes to a membrane containing size separated DNA fragments arising from a typical Sanger reaction.
The demand for sequencing information is larger than can be supplied by the currently existing sequencing machines, such as the ABI377 and the Pharmacia ALF. One of the principal limitations of the current technology is the small number of tags which can be resolved using the current tagging system. The Church pooling system discussed above uses more tags but the use and detection of these tags is laborious.
The present invention discloses novel compositions and methods which may be utilized to sequence nucleic acid molecules with greatly increased speed and sensitivity than the methods described above, and further provides other related advantages.
Briefly stated, the present invention provides methods, compounds, compositions, kits and systems for determining the sequence of nucleic acid molecules. Within one aspect of the invention, methods are provided for determining the sequence of a nucleic acid molecule. The methods includes the steps: (a) generating tagged nucleic acid fragments which are complementary to a selected target nucleic acid molecule, wherein a tag is correlative with a particular nucleotide and detectable by non-fluorescent spectrometry or potentiometry; (b) separating the tagged fragments by sequential length; (c) cleaving the tags from the tagged fragments; and (d) detecting the tags by non-fluorescent spectrometry or potentiometry, and therefrom determining the sequence of the nucleic acid molecule. In preferred embodiments, the tags are detected by mass spectrometry, infrared spectrometry, ultraviolet spectrometry or potentiostatic amperometry.
In another aspect, the invention provides a compound of the formula:
Tmsxe2x80x94Lxe2x80x94X
wherein Tms is an organic group detectable by mass spectrometry, comprising carbon, at least one of hydrogen and fluoride, and optional atoms selected from oxygen, nitrogen, sulfur, phosphorus and iodine; L is an organic group which allows a Tms-containing moiety to be cleaved from the remainder of the compound, wherein the Tms-containing moiety comprises a functional group which supports a single ionized charge state when the compound is subjected to mass spectrometry and is selected from tertiary amine, quaternary amine and organic acid; X is a functional group selected from hydroxyl, amino, thiol, carboxylic acid, haloalkyl, and derivatives thereof which either activate or inhibit the activity of the group toward coupling with other moieties, or is a nucleic acid fragment attached to L at other than the 3xe2x80x2 end of the nucleic acid fragment; with the provisos that the compound is not bonded to a solid support through X nor has a mass of less than 250 daltons.
In another aspect, the invention provides a composition comprising a plurality of compounds of the formula Tmsxe2x80x94Lxe2x80x94MOI, wherein, Tms is an organic group detectable by mass spectrometry, comprising carbon, at least one of hydrogen and fluoride, and optional atoms selected from oxygen, nitrogen, sulfur, phosphorus and iodine; L is an organic group which allows a Tms-containing moiety to be cleaved from the remainder of the compound, wherein the Tms-containing moiety comprises a functional group which supports a single ionized charge state when the compound is subjected to mass spectrometry and is selected from tertiary amine, quaternary amine and organic acid; MOI is a nucleic acid fragment wherein L is conjugated to the MOI at a location other than the 3xe2x80x2 end of the MOI; and wherein no two compounds have either the same Tms or the same MOI.
In another aspect, the invention provides a composition comprising water and a compound of the formula Tmsxe2x80x94Lxe2x80x94MOI, wherein, Tms is an organic group detectable by mass spectrometry, comprising carbon, at least one of hydrogen and fluoride, and optional atoms selected from oxygen, nitrogen, sulfur, phosphorus and iodine; L is an organic group which allows a Tms-containing moiety to be cleaved from the remainder of the compound, wherein the Tms-containing moiety comprises a functional group which supports a single ionized charge state when the compound is subjected to mass spectrometry and is selected from tertiary amine, quaternary amine and organic acid; and MOI is a nucleic acid fragment wherein L is conjugated to the MOI at a location other than the 3xe2x80x2 end of the MOI.
In another aspect, the invention provides for a composition comprising a plurality of sets of compounds, each set of compounds having the formula Tmsxe2x80x94Lxe2x80x94MOI, wherein, Tms is an organic group detectable by mass spectrometry, comprising carbon, at least one of hydrogen and fluoride, and optional atoms selected from oxygen, nitrogen, sulfur, phosphorus and iodine; L is an organic group which allows a Tms-containing moiety to be cleaved from the remainder of the compound, wherein the Tms-containing moiety comprises a functional group which supports a single ionized charge state when the compound is subjected to mass spectrometry and is selected from tertiary amine, quaternary amine and organic acid; MOI is a nucleic acid fragment wherein L is conjugated to the MOI at a location other than the 3xe2x80x2 end of the MOI; wherein within a set, all members have the same Tms group, and the MOI fragments have variable lengths that terminate with the same dideoxynucleotide selected from ddAMP, ddGMP, ddCMP and ddTMP; and wherein between sets, the Tms groups differ by at least 2 amu.
In another aspect, the invention provides for a composition comprising a first plurality of sets of compounds as described in the preceding paragraph, in combination with a second plurality of sets of compounds having the formula Tmsxe2x80x94Lxe2x80x94MOI, wherein, Tms is an organic group detectable by mass spectrometry, comprising carbon, at least one of hydrogen and fluoride, and optional atoms selected from oxygen, nitrogen, sulfur, phosphorus and iodine; L is an organic group which allows a Tms-containing moiety to be cleaved from the remainder of the compound, wherein the Tms-containing moiety comprises a functional group which supports a single ionized charge state when the compound is subjected to mass spectrometry and is selected from tertiary amine, quaternary amine and organic acid; MOI is a nucleic acid fragment wherein L is conjugated to the MOI at a location other than the 3xe2x80x2 end of the MOI; and wherein all members within the second plurality have an MOI sequence which terminates with the same dideoxynucleotide selected from ddAMP, ddGMP, ddCMP and ddTMP; with the proviso that the dideoxynucleotide present in the compounds of the first plurality is not the same dideoxynucleotide present in the compounds of the second plurality.
In another aspect, the invention provides for a kit for DNA sequencing analysis. The kit comprises a plurality of container sets, each container set comprising at least five containers, wherein a first container contains a vector, a second, third, fourth and fifth containers contain compounds of the formula Tmsxe2x80x94Lxe2x80x94MOI wherein, Tms is an organic group detectable by mass spectrometry, comprising carbon, at least one of hydrogen and fluoride, and optional atoms selected from oxygen, nitrogen, sulfur, phosphorus and iodine; L is an organic group which allows a Tms-containing moiety to be cleaved from the remainder of the compound, wherein the Tms-containing moiety comprises a functional group which supports a single ionized charge state when the compound is subjected to mass spectrometry and is selected from tertiary amine, quaternary amine and organic acid; and MOI is a nucleic acid fragment wherein L is conjugated to the MOI at a location other than the 3xe2x80x2 end of the MOI; such that the MOI for the second, third, fourth and fifth containers is identical and complementary to a portion of the vector within the set of containers, and the Tms group within each container is different from the other Tms groups in the kit.
In another aspect, the invention provides for systems for determining the sequence of a nucleic acid molecule in a sample. In one embodiment, a system comprises a system for determining the sequence of a nucleic acid molecule in a sample, the sample including tagged nucleic acid fragments having nucleic acid fragments and tags attached to the nucleic acid fragments, comprising a separation apparatus that separates tagged nucleic acid fragments, a cleavage apparatus that receives separated tagged cleaves nucleic acid fragments and the tags from the nucleic acid fragments, each tag being correlative with a particular nucleotide of the nucleic acid fragment and detectable by electrochemical detection, and an apparatus for electrochemical detection that receives and detects electrochemical signatures of the tags. In a preferred embodiment, the system further includes a data processor that correlates the electrochemical signature of a tag to a particular nucleotide and to a specific sample. In another embodiment, a system comprises a system for determining the sequence of a nucleic acid molecule in a sample, the sample including tagged nucleic acid fragments having nucleic acid fragments and tags attached to the nucleic acid fragments, comprising a separation apparatus that separates tagged nucleic acid fragments, a cleavage apparatus that receives separated tagged nucleic acid fragments and cleaves from the nucleic acid fragments, each tag being correlative with a particular nucleotide of the nucleic acid fragment and detectable by mass spectrometry, a mass spectrometer that receives the tags and detects a mass of a tag, and a data processor that correlates the mass of a tag to a particular nucleotide and to a specific sample.
Within other embodiments of the invention, 4, 5, 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 200, 250, 300, 350, 400, 450 or greater than 500 different and unique tagged molecules may be utilized within a given reaction simultaneously, wherein each tag is unique for a selected nucleic acid fragment, probe, or first or second member, and may be separately identified.
These and other aspects of the present invention will become evident upon reference to the following detailed description and attached drawings. In addition, various references are set forth below which describe in more detail certain procedures or compositions (e.g., plasmids, etc.), and are therefore incorporated by reference in their entirety.