Complete reference genome sequences for a number of model organisms as well as humans are currently available or are expected to become available in the near future. A parallel challenge is to characterize the type and extent of variation in the sequences of interest because it underlies the heritable differences among individuals and populations. In humans, the vast majority of sequence variation consists of nucleotide substitutions referred to as single nucleotide polymorphisms (SNPs). DNA sequencing is the most sensitive method to discover polymorphisms [Eng C. and Vijg J. et al., Nature Biotechnol. 15: 422–426 (1997)]. A growing panel of such sequence variants, together with powerful methods to monitor them [Landegren U. et al., Genome Res. 8: 769–776 (1998)], is useful in linkage studies to identify even the most subtle disease susceptibility loci [Lander E. and Schork N., Science 265: 2037–2048 (1994); Risch N. and Merikangas K., Science 273: 1516–1517 (1996)]. Also, the identification of all (functional) allelic variants will require the re-sequencing of particular regions in a large number of samples [Nickerson D. et al., Nature Genet. 19: 233–240 (1998)]. Although a number of methods to monitor known SNPs have been developed [Landegren U. et al., Genome Res. 8: 769–776 (1998)], re-sequencing is likely to be routinely applied to secure diagnoses of patients. Indeed, in a significant number of disease-associated genes that have been surveyed thus far, literally hundreds or even thousands of different mutations have been identified and catalogued. Consequently, sequence determination represents the ultimate level of resolution and may be the preferred method to monitor which mutation or combination of mutations, out of a large number of mutations of known clinical relevance, is present.
It would appear that the field of biomedical genetics will rely heavily on sequencing technology. Hence, there is a need for advanced sequencing methods that are time- and cost-competitive, and at the same time accurate and robust. Recent developments in this area include improvements to the basic dideoxy chain termination sequencing method [Sanger et al. Proc. Natl. Acad. Sci. USA 74: 5463–5467 (1977); reviewed by Lipshutz R. and Fodor S. et al., Current Opinion in Structural Biology 4: 376–380 (1994)], as well as new approaches that are based on entirely new paradigms. Two such novel approaches are sequencing-by-hybridization (SBH) [Drmanac R. et al., Science 260: 1649–1652 (1993)] and pyro-sequencing [Ronaghi M. et al., Science 281: 363–365 (1998); Ronaghi M. et al., Anal. Biochem. 242: 84–89 (1996)]. While the concepts of these approaches have been experimentally validated, their ultimate acceptance and usage may depend on the type of application —e.g. de novo sequencing, re-sequencing, and genotyping of known SNPs.
Recently, progress has also been made in the use of mass spectroscopy (MS) to analyze nucleic acids [Crain, P. F. and McCloskey, J. A., Current Opinion in Biotechnology 9: 25–34 (1998), and references cited therein]. One promising development has been the application of MS to the sequence determination of DNA and RNA oligonucleotides [Limbach P., Mass Spectrom. Rev. 15: 297–336 (1996); Murray K., J. Mass Spectrom. 31: 1203–1215 (1996)]. MS and more particularly, matrix-assisted laser desorption/ionization MS (MALDI MS) has the potential of very high throughput due to high-speed signal acquisition and automated analysis off solid surfaces. It has been pointed out that MS, in addition to saving time, measures an intrinsic property of the molecules, and therefore yields a significantly more informative signal [Koster H. et al., Nature Biotechnol., 14: 1123–1128 (1996)].
Sequence information can be derived directly from gas-phase fragmentation [see for example Nordhoff E. et al., J. Mass Spectrom., 30:99–112 (1995); Little D. et al., J. A. Chem. Soc., 116: 4893–4897 (1994); Wang B. et al., WO 98/03684 and WO 98/40520; Blocker H. et al., EP 0 103 677; Foote S. et al., WO 98/54571]. In contrast, indirect methods measure the mass of fragments obtained by a variety of methods in the solution phase, i.e., prior to the generation of gas phase ions. In its simplest form, mass analysis replaces the gel-electrophoretic fractionation of the fragment-ladder (i.e., a nested set of fragments that share one common endpoint) generated by the sequencing reactions. The sequencing reactions need not necessarily be base-specific because the base-calling may also be based on accurate mass measurement of fragments that terminate at successive positions and that differ from one another by one nucleotide residue. The fragment-ladder can be generated by the Sanger method [Köster H. et al., Nature Biotechnol., 14: 1123–1128 (1996); Reeve M. A., Howe R. P., Schwarz T., U.S. Pat. No. 5,849,542; Köster H., U.S. Pat. No. 5,547,835; Levis R. and Romano L., U.S. Pat. No. 5,210,412 and U.S. Pat. No. 5,580,733; Chait B. and Beavis R., U.S. Pat. No. 5,453,247], by base-specific partial RNA digestion [Hahner S. et al., Nucleic Acids Res., 25: 1957–1964 (1997); Köster H., WO 98/20166] or by chemical cleavage [Isola N. et al., Anal. Chem., 71: 2266–2269 (1999); references cited in Limbach P., Mass Spectrom. Rev., 15: 297–336 (1996)]. An alternative method consists of analyzing the ladder generated by exonuclease digestion from either the 3′- or 5′-end [Pieles U. et al., Nucleic Acids Res., 21: 3191–3196 (1993); Köster H., U.S. Pat. No. 5,851,765; Engels J. et al., WO 98/45700; Tarr G. and Patterson D., WO 96/36986; Patterson D., U.S. Pat. No. 5,869,240].
A severe limitation of both the direct and indirect MS methodologies under the current performance conditions is the poor applicability to chain lengths beyond ˜30–50 nucleotides. As a consequence, it has been suggested that the prospects for MS lie with DNA diagnostic assays, rather than large-scale sequencing [Smith L., Nature Biotechnol., 14: 1084–1087 (1996)]. Given the fact that MS represents an exquisite means to analyze short nucleotide fragments, the various MS-based processes that have been described for nucleic acid based diagnostic purposes generally involve the derivation and analysis of such relatively short fragments [see for example Koster H., WO 96/29431; Koster H. et al., WO 98/20166; Shaler T. et al., WO 98/12355; Kamb A., U.S. Pat. No. 5,869,242; Monforte J. et al., WO 97/33000; Foote S. et al., WO 98/54571].
Some of the MS-based assays have been used for the scoring of defined mutations or polymorphisms. Other processes derive multiple oligonucleotide fragments and yield a ‘mass-fingerprint’ so as to analyze a larger target nucleic acid region for mutations and/or polymorphisms. The latter MS analyses are however considerably less informative in that they are essentially restricted to the detection of sequence variations. The methods cannot be applied to diagnostic sequencing of nucleic acids, where the term diagnostic sequencing means the unequivocal determination of the presence, the nature and the position of sequence variations. At best, the measurements confirm the base composition of small fragments whose masses are determined with sufficient accuracy to reduce the number of possible compositional isomers. Also, it will be realized that only certain changes in composition (as revealed by shifts in the mass spectrum) can be unambiguously assigned to a polymorphism or mutation. A match between the spectrum of the interrogated sequence and a reference-spectrum obtained from wild-type sequence or sequences known to contain a given polymorphism, is assumed to indicate that the interrogated nucleic acid region is wild-type or incorporates the previously known polymorphisms, thereby disregarding certain other possible interpretations.
While most methods in the art do yield sequence-related information, they do not disclose that a combination of several different mass spectra, obtained after complementary digestion reactions, allows for the effective survey of a nucleic acid region and provides an unambiguous assignment of both known as well as previously unknown sequence variations that occur relative to a reference nucleic acid with a known nucleotide sequence.
In view of the limitations of the methods described above, the art would clearly benefit from a new procedure for the diagnostic sequencing of nucleic acids that would overcome the shortcomings of the processes discussed above.
In comparison with conventional sequencing technology, i.e., the gel-electrophoretic analysis of fragment ladders, the methods of the present invention are more suited for the simultaneous analysis of multiple target sequences. In general, each particular sequence or sequence variant is associated with a distinct set of mass peaks. Consequently, the sequencing reactions according to the methods of the present invention lend themselves readily to (i) multiplexing (i.e., the analysis of two or more target non-contiguous target regions from a single biological sample), (ii) the analysis of heterozygous samples, as well as (iii) pooling strategies (i.e., the simultaneous sequencing of the analogous regions derived from two or more different biological samples).
Because of the multiplex capacity, the present methods can be adapted as a tool for the genome-wide discovery and scoring of polymorphisms (e.g., SNPs) useful as markers in genetic linkage studies. The unambiguous identification/diagnosing of a number of variant positions is less demanding than full sequencing and, consequently, a considerable number of target genomic loci can be combined and analyzed at the same time, especially when their lengths are kept relatively small. The number of markers that can be scored in parallel will depend on the level of genetic diversity in the species of interest and on the precise method used to prepare and analyze the target nucleic acids, but may typically be in the order of a few tens to up to 100 with current MS capabilities. The addition of multiplexing to the high-precision and high-speed characteristics of MS constitutes a new marker technology that enables the large-scale and cost-effective scoring of several (tens of) thousands of markers. Some aspects of the application of the present methods to genome-wide genotyping are described in Example 5.
Sequencing reactions according to the methods of the present invention yield, in principle, a discrete set of fragments for each individual sequence or sequence variant whereas conventional sequence ladders stack on top of one another. Therefore, such sequences or sequence variants can be analyzed even when present as a lesser species. This is a useful quality for the analysis of clinical samples which are often genetically heterogeneous because of the presence of both normal and diseased cells or in itself (e.g., cancerous tissue, viral quasi-species). Additionally, the ability to detect mutations at a low ratio of mutant over wild-type allele makes it practicable to pool individual biological samples, a strategy which should permit a more cost-effective search for genomic sequence variations in a population.
The present invention rests in part on the insight that integration of the data obtained in a set of complementary fingerprints produced by an appropriate set of complementary cleavage reactions of the invention represents a level of characterization of a sample nucleic acid essentially equal to sequence determination. The present invention is also directed to the use of cleavage protocols that result in the generation of cleavage products that range from mono- and dinucleotides to fragments of a few tens of nucleotides that are particularly suited for analysis by MS. At the same time, the present method is distinct from the other fragmentation processes that are limited to screening target nucleic acids for a wide range of potential mutations. According to the present invention, a combination of several different mass spectra, obtained after complementary digestion reactions, coupled with systematic computational analysis allows the survey of a selected target nucleic acid or region thereof and leads to the unambiguous assignment of both known and previously unknown sequence variations. In certain aspects of the present invention, knowledge of the reference sequence in combination with the methods disclosed herein allows modeling of the experimental approach, anticipation of potential ambiguities, and the design of an adequate resolution.