Determining the nucleotide sequence of nucleic acids (DNA and RNA) is critical to understanding the function and control of genes and their relationship, for example, to disease discovery and disease management. Analysis of genetic information plays a crucial role in the biological experimentation. This has become especially true with regard to studies directed at understanding the fundamental genetic and environmental factors associated with disease and the effects of potential therapeutic agents on the cell. This paradigm shift has lead to an increasing need within the life science industries for more sensitive, more accurate and higher-throughput technologies for performing analysis on genetic material obtained from a variety of biological sources.
Because sequencing the enormously large number of nucleic acids in each human cell is necessarily a time-consuming process, there is always a pressing need for faster and higher through-put analyses that do not sacrifice sensitivity and accuracy. A number of techniques have been developed, including, inter alia, electrophoresis, enzymatic and chemical analysis, array technology and mass spectrometry, to determine the nucleotide sequence of nucleic acids.
Electrophoretic Techniques
Slab or capillary polyacrylamide gel electrophoresis technologies, such as those employed in automated DNA sequencers, provide highly accurate de novo sequence information for relatively long (500-700 residues or bases) segments of DNA. Although electrophoresis-based techniques provide a great amount of information per sample, they require long sample preparation and set-up times and thereby limit throughput.
Enzymatic and Chemical Analysis
A number of enzymatic and chemical techniques exist to determine the de novo nucleotide sequence of nucleic acids. However, each technique has inherent limitations. For example, Maxam and Gilbert [Proc. Natl. Acad. Sci. USA 74:5460 (1977)] disclose a chemical degradation approach and Sanger et al. [Proc. Natl. Acad. Sci. USA 74:5463 (1977)] disclose a chain termination method using complementary strand primer extension. Each of these techniques utilizes four separate reaction mixtures to create a nested set of fragments differing by a single nucleotide in length, thus representing a complete nucleotide sequence. A resolution of the fragments based on their size and terminating nucleotide is carried out to determine the order of the fragments and hence the nucleotide sequence.
Single-stranded conformation polymorphism (SSCP) analysis a useful technique for detecting relatively small differences among similar sequences. The technique is simple to implement and, when combined with multiple-dye detection or mass-tag methodologies, may be multiplexed and thereby improve throughput. However, like techniques that rely on detecting heteroduplexes, such as denaturing gradient gel electrophoresis (DGGE), chemical cleavage (CCM), enzymatic cleavage (using cleavase) of mismatches, and denaturing high performance liquid chromatography (DHPLC), the technique is only qualitative, i.e., the technique only reveals whether a mutation is present within the target nucleic acid but gives minimal information about the identity and location of the mutation.
Other techniques employing ligase and polymerase extension assays are useful for determining whether a mutation is present at a defined location in an otherwise known target nucleic acid sequence. U.S. Pat. No. 4,988,617, for example, discloses a method for determining whether a mutation is present at a defined location in an otherwise known target nucleic acid sequence by assaying for the ligation of two natural oligonucleotides that are designed to hybridize adjacent to one another along the target sequence. U.S. Pat. No. 5,494,810 discloses a method that utilizes a thermostable ligase and the ligase chain reaction (LCR) to detect specific nucleotide substitutions, deletions, insertions and translocations within an otherwise known target nucleic acid sequence using only natural nucleic acids. U.S. Pat. No. 5,403,709 discloses a method for determining the nucleotide sequence by using another oligonucleotide as an extension and a third, bridging oligonucleotide to hold the first two together for ligation, and WO 97/35033 discloses methods for determining the identity of a nucleotide 3' to a defined primer using a polymerase extension assay. Although the assays may be performed with a relatively high throughput, they are sequence specific and, thus require a different set of reagents for each target to be analyzed.
U.S. Pat. Nos. 5,521,065, 4,883,750 and 5,242,794 (Whiteley, et al.) disclose methods of testing for the presence or absence of a target sequence in a mixture of single-stranded nucleic acid fragments. The method involves reacting a mixture of single-stranded nucleic acid fragments with a first probe that is complementary to a first region of the target sequence and with a second probe that is complementary to a second region of the target sequence. The first and second target regions are contiguous with one another. Hybridization conditions are used in which the two probes become stably hybridized to their associated target regions. Following hybridization, any of the first and second probes hybridized to contiguous first and second target regions are ligated, and the sample is subsequently tested for the presence of expected probe ligation product.
Array Technology
Techniques employing hybridization to surface-bound DNA probe arrays are useful for analyzing the nucleotide sequence of target nucleic acids. These techniques rely upon the inherent ability of nucleic acids to form duplexes via hydrogen bonding according to Watson-Crick base-pairing rules. In theory, and to some extent in practice, hybridization to surface-bound DNA probe arrays can provide a relatively large amount of information in a single experiment. For example, array technology has identified single nucleotide polymorphisms within relatively long (1,000 residues or bases) sequences (Kozal, M., et al., Nature Med. 7:753-759, July 1996). In addition, array technology is useful for some types of gene expression analysis, relying upon a comparative analysis of complex mixtures of MRNA target sequences (Lockart, D., et al., (1996) Nat. Biotech. 14, 1675-1680). Although array technologies offer the advantages of being reasonably sensitive and accurate when developed for specific applications and for specific sets of target sequences, they lack a generic implementation that can simultaneously be applied to multiple and/or different applications and targets. This is in large part due to the need for relatively long probe sequences, which are required to form and subsequently detect the probe/target duplexes. Moreover, this use of relatively long probes makes it difficult to interrogate single nucleotide differences due to the inherently small thermodynamic difference between the perfect complement and the single mismatch within the probe/target duplex. In addition, detection depends upon solution diffusion properties and hydrogen bonding between complementary target and probe sequences.
Mass Spectrometry Techniques
Mass spectrometry (MS) is a powerful tool for analyzing complex mixtures of compounds, including nucleic acids. In addition to accurately determining an intact mass, primary structure information can be obtained by several different MS strategies. The use of MS for DNA analysis has potential application to the detection of DNA modifications, DNA fragment mass determination, and DNA sequencing (see for example; Fields, G. B., Clinical Chemistry 43, 1108 (1997)). Both fast atom bombardment (FAB) and electrospray ionization (ESI) collision-induced dissociation/tandem MS have been applied for identification of DNA modification sites.
Although MS is a powerful tool for analyzing complex mixtures of related compounds, including nucleic acids, its utility for analyzing the sequence of nucleic acids is limited by available ionization and detection methods. For example, ESI spectrometry produces a distribution of highly charged ions having a mass-to-charge ratio in the range of commercially available quadrupole mass analyzers. While ESI is sensitive, requiring only femtomole quantities of sample, it relies on multiple charges to achieve efficient ionization and produces complex and difficult-to-interpret multiply-charged spectra for even simple nucleic acids.
Matrix-assisted laser desorption ionization (MALDI) used in conjunction with a time-of-flight (TOF) mass analyzer holds great potential for sequencing nucleic acids because of its relatively broad mass range, high resolution (m/.DELTA.m.ltoreq.1.0 at mass 5,000) and sampling rate (up to 1 sample/second). In one aspect MALDI offers a potential advantage over ESI and FAB in that biomolecules of large mass can be ionized and analyzed readily. Furthermore, in contrast to ESI, MALDI produces predominantly singly charged species.
However, in general, MALDI analysis of DNA may suffer from lack of resolution of high molecular weight DNA fragments, DNA instability, and interference from sample preparation reagents. Longer oligonucleotides can give broader, less-intense signals, because MALDI imparts greater kinetic energies to ions of higher molecular weights. Although it may be used to analyze high molecular-weight nucleic acids, MALDI-TOF induces cleavage of the nucleic acid backbone, which further complicates the resulting spectrum. As a result, the lengths of nucleic acid sequences that may currently be analyzed via MALDI-TOF is limited to about 100 bases or residues. Wang et al. (WO 98/03684) have taken advantage of "in source fragmentation" and coupled it with delayed pulsed ion extraction methods for determining the sequence of nucleic acid analytes.
A number of methods have been disclosed that take advantage of standard sequencing methods for generating target fragments for analysis by mass spectroscopy. For example, U.S. Pat. No. 5,288,644 (Beavis, et al.); U.S. Pat. No. 5,547,835 (Koster) and U.S. Pat. No. 5,622,824 (Koster) disclose methods for determining the sequence of a target nucleic acid using MALDI-TOF of ladders of the target produced either by exonuclease digestion or by standard Sanger sequencing methods. Beavis discusses a method for DNA sequencing utilizing different base-specific reactions to use different sets of DNA fragments to form a piece of DNA of unknown sequence. Each of the different sets of DNA fragments has a common origin and terminates at a particular base along the unknown sequence. The molecular weights of the DNA fragments in each of the different sets are determined by a MALDI mass spectrometer which is then used to deduce the nucleotide sequence of the DNA.
Koster utilizes the Sanger sequencing strategy and assembles the sequence information by analysis of the nested fragments obtained by base-specific chain termination via their different molecular masses using mass spectrometry such as MALDI or ESI mass spectrometry. This method has been coupled with a solid-phase sequencing approach in which the template is labeled with biotin and bound to streptavidin-coated magnetic beads. Using this method, it was possible to sequence exons 5 and 8 of p53 gene using 21 defined primers (Fu et al., Nat. Biotechnol 16, 381 (1998)). Throughput can be increased by introducing mass modifications in the oligonucleotide primer, chain-terminating nucleoside triphosphates and/or in the chain-elongating nucleoside triphosphates, as well as using integrated tag sequences that allow multiplexing by hybridization of tag specific probes with mass differentiated molecular weights (U.S. Pat. No. 5,547,835). It is important to note, however, that all of these sequencing methods require either some prior knowledge of the target sequence or introduction of a known sequence to serve as the primer-binding site.
Efforts have been made to use mass spectrometry with enzymatic assays to determine the presence, location and identity of mutations in otherwise known sequences wherein at least some information is known a priori about the presence, location and/or identity of the mutation. U.S. Pat. No. 5,605,798, for example, discloses a method wherein a DNA primer that is complementary to a known target molecule in a region adjacent to the known region of interest is extended with a DNA polymerase in the presence of mass-tagged dideoxynucleotides. The identity of the mutation is then determined by analyzing the mass of the dideoxy-extended DNA primer. The multiplexing method is disclosed to be useful for simultaneously detecting all possible mutants/variants at a defined site by extending with a dideoxynucleotide and determining which specific dideoxynucleotide was incorporated.
Efforts have been made to address some of the aforementioned deficiencies with mass spectroscopic analyses of nucleic acids. For example, Gut (WO 96/27681) discloses methods for altering the charge properties of the phosphodiester backbone of nucleic acids in ways that make them more suitable for MS analyses. Methods for introducing modified nucleotides that stabilize the nucleic acid against fragmentation have also been described (Schneider and Chait, Nucleic Acids Res, 23, 1570 (1995), Tang et al., J Am Soc Mass Spectrom, 8, 218-224, 1997).
The use of non-cleavable mass tags has also been exploited to address some of the aforementioned deficiencies. For example, Japanese Patent No. 59-131909 discloses a mass spectrometer design that detects nucleic acid fragments separated by electrophoresis, liquid chromatography or high speed gel filtration, wherein atoms have been incorporated into the nucleic acids. The atoms, which normally do not occur in DNA, are sulfur, bromine, iodine, silver, gold, platinum, and mercury.
Cleavable mass tags have been exploited to circumvent some of the problems associated with MS analysis of nucleic acids. For example, PCT Application WO 95/04160 (Southern, et al.) discloses an indirect method for analyzing the sequence of target nucleic acids using target-mediated ligation between a surface-bound DNA probe and cleavable mass-tagged oligonucleotides containing reporter groups using mass spectrometric techniques. The sequence to be determined is first hybridized to an oligonucleotide attached to a solid support. The solid support carrying the hybrids from above is incubated with a solution of coded oligonucleotide reagents that form a library comprising all sequences of a given length. Ligase is introduced so that the oligonucleotide on the support is ligated to the member of the library that is hybridized to the target adjacent the oligonucleotide. Non-ligated reagents are removed by washing. A linker that is part of the member of the library ligated to the oligonucleotide is broken to detach a tag, which is recovered and analyzed by mass spectrometry.
A common focus of the above technologies is to provide methods for increasing the number of target sites (either intra- or inter-target) that can be interrogated in a single determination where some portion of the target sequence is known. This multiplexing theme is either directly stated or implied in the teachings of the above patent applications. The use of more than one oligonucleotide as either a hybridization probe or primer for extension or ligation is defined by the sequence surrounding the site of interest and, therefore, the specific application. Thus, with the exception of the mass-tag technology disclosed by Southern, the oligonucleotide reagents described above are not generic in terms of target sequence, but must be generated for each defined application. As such, the number of distinct oligonucleotides used in a multiplexed interrogation is generally only a small subset of the theoretical sequence-complete set. This ratio of actual sequence coverage provided by a particular oligonucleotide mixture to the theoretical coverage provided by the sequence-complete set is defined as the mixture coverage complexity (see discussion below). For example, in many of the methods described (i.e., U.S. Pat. No. 5,605,798, WO 92/15712, and WO 97/35033), the probe lengths vary from about 8 to 20 nucleotides depending upon the specific application and method of detection. The number of probes in a sequence-complete set can be described by the equation 4.sup.L where L equals the length of the probes. Thus for 8-mer probes, the sequence-complete set has to 4.sup.8 or 65,536 members. If the number of interrogation sites in the multiplexed determination is about 500, which is a reasonable upper boundary for the number of oligonucleotide probes in a single determination for the types of technologies described above, then the mixture coverage complexity (see discussion below) of the interrogating 8-mer probe mixture would be equal to 500/65,536 or approximately 1/130. In most cases, however, the probes are 15-20 nucleotides in length. While this increased length ensures specificity of the probe for a defined target sequence, it makes the mixture coverage complexity of the probe mixture significantly smaller. Thus, it is clear that for the types of multiplexing methods and applications described above, the interrogating oligonucleotide mixtures are not designed to be sequence complete with regard to target sequence coverage and could not therefore be considered generic reagents.
The object of many array-based sequencing techniques is to determine the "short word" content, i.e., all of the oligonucleotide subsequences present, in the target nucleic acid sequence. For example, in techniques employing hybridization to surface-bound DNA probe arrays, a set of oligonucleotides of a particular length are arranged in spatially distinct locations on a substrate to form an array, and the target sequence is permitted to hybridize to the array (see for example, U.S. Pat. No. 5,202,231, U.S. Pat. No. 5,492,806, and U.S. Pat. No. 5,695,940). The target sequence will bind at locations that contain a short word complementary to one of the short words in its sequence. Others have disclosed methods for probing surface-bound targets with a sequential set of oligonucleotide probes (see for example, U.S. Pat. No. 5,202,231, U.S. Pat. No. 5,492,806, and U.S. Pat. No. 5,695,940). By identifying the hybridization locations, or knowing the identity of the probing oligonucleotide via a fluorescence measurement or the like, the precise short word content of the target nucleic acid sequence may theoretically be determined. This information can then be used to reconstruct the sequence of the target nucleic acid (see for example; Pevzner, P. A., J Biomolecular Structure Dynamics 7, 63 (1989), Pevzner P. A., et al., J Biomolecular StructureDynamics 9, 399 (1991), Ukkonen, E., Theoretical Computer Science 92, 191 (1992)). It is important to emphasize, however, that relatively sequence-complete sets of oligonucleotide probes are required in order to generically determine the short word content an unknown target.
Techniques that identify the short-word content of the target nucleic acid sequence are useful for applications such as de novo sequencing, re-sequencing, mutation detection and mutational change detection. As the length of the target sequence increases, the success rate or success rate with which the analysis may be carried out decreases. Because some of the applications, e.g., mutation detection, require only qualitative information, the success rate may typically be higher than the success rate for an application requiring quantitative information, e.g., de novo sequencing. For example, the presence of a few short word repeats would severely reduce the success rate for de novo sequencing but would have a lesser of an effect on the success rate for mutation detection. In other applications, substantial prior information is available to assist in the interpretation of the short-word content, thus increasing the success rate of the results.
The purpose of the present invention is to determine the short word content of a target nucleic acid sequence using mass spectroscopy. However, the success rate of such an analysis is expected to be relatively low because the presence of a particular mass in the mass spectrum only reveals that one of many possible nucleic acid sequences is present. For example, using only natural nucleotides, the sequence of GGCTTTA is indistinguishable by mass from the sequence of GCTTTAG, and the presence of a mass peak at 2,142 atomic mass units merely reveals that at least one nucleic acid sequence with 3 T's, 2 G's and 1 A and 1 C is present in the mixture. The ambiguity is further confounded by mass coincidences. For example, the mass peak at 2,193 may contain contributions from nucleic acid sequences containing 6 A's and 1 T or 1 A, 2 C's, 3 G's and 1 T. The purpose of the present invention is to reduce these types of ambiguities within the short-word content of a target nucleic acid sequence.