1. Field of the Invention
This invention relates to methods for detecting and sequencing nucleic acids with sequencing by hybridization technology and molecular weight analysis, to probes and probe arrays useful in sequencing and detection and to kits and apparatus for determining sequence information.
2. Description of the Background
Since the recognition of nucleic acid as the carrier of the genetic code, a great deal of interest has centered around determining the sequence of that code in the many forms which it is found. Two landmark studies made the process of nucleic acid sequencing, at least with DNA, a common and relatively rapid procedure practiced in most laboratories. The first describes a process whereby terminally labeled DNA molecules are chemically cleaved at single base repetitions (A. M. Maxam and W. Gilbert, Proc. Natl. Acad. Sci. USA 74:560-64, 1977). Each base position in the nucleic acid sequence is then determined from the molecular weights of fragments produced by partial cleavages. Individual reactions were devised to cleave preferentially at guanine, at adenine, at cytosine and thymine, and at cytosine alone. When the products of these four reactions are resolved by molecular weight, using, for example, polyacrylamide gel electrophoresis, DNA sequences can be read from the pattern of fragments on the resolved gel.
The second study describes a procedure whereby DNA is sequenced using a variation of the plus-minus method (F. Sanger et al., Proc. Natl. Acad. Sci. USA 74:5463-67, 1977). This procedure takes advantage of the chain terminating ability of dideoxynucleoside triphosphates (ddNTPs) and the ability of DNA polymerase to incorporate ddNTP with nearly equal fidelity as the natural substrate of DNA polymerase, deoxynucleosides triphosphates (dNTPs). Briefly, a primer, usually an oligonucleotide, and a template DNA are incubated together in the presence of a useful concentration of all four dNTPs plus a limited amount of a single ddNTP. The DNA polymerase occasionally incorporates a dideoxynucleotide which terminates chain extension. Because the dideoxynucleotide has no 3′-hydroxyl, the initiation point for the polymerase enzyme is lost. Polymerization produces a mixture of fragments of varied sizes, all having identical 3′ termini. Fractionation of the mixture by, for example, polyacrylamide gel electrophoresis, produces a pattern which indicates the presence and position of each base in the nucleic acid. Reactions with each of the four ddNTPs allows one of ordinary skill to read an entire nucleic acid sequence from a resolved gel.
Despite their advantages, these procedures are cumbersome and impractical when one wishes to obtain megabases of sequence information. Further, these procedures are, for all practical purposes, limited to sequencing DNA. Although variations have developed, it is still not possible using either process to obtain sequence information directly from any other form of nucleic acid.
A relatively new method for obtaining sequence information from a nucleic acid has recently been developed whereby the sequences of groups of contiguous bases are determined simultaneously. In comparison to traditional techniques whereby one determines base-specific information of a sequence individually, this method, referred to as sequencing by hybridization (SBH), represents a many-fold amplification in speed. Due, at least in part to the increased speed, SBH presents numerous advantages including reduced expense and greater accuracy. Two general approaches of sequencing by hybridization have been suggested and their practicality has been demonstrated in pilot studies. In one format, a complete set of 4″ nucleotides of length n is immobilized as an ordered array on a solid support and an unknown DNA sequence is hybridized to this array (K. R. Khrapko et al., J. DNA Sequencing and Mapping 1:375-88, 1991). The resulting hybridization pattern provides all “n-tuple” words in the sequence. This is sufficient to determine short sequences except for simple tandem repeats.
In the second format, an array of immobilized samples is hybridized with one short oligonucleotide at a time (Z. Strezoska et al., Proc. Natl. Acad. Sci. USA 88:10, 089-93, 1991). When repeated 4n times for each oligonucleotide of length n, much of the sequence of all the immobilized samples would be determined. In both approaches, the intrinsic power of the method is that many sequenced regions are determined in parallel. In actual practice the array size is about 104 to 105.
Another aspect of the method is that information obtained is quite redundant, and especially as the size of the nucleic acid probe grows. Mathematical simulations have shown that the method is quite resistant to experimental errors and that far fewer than all probes are necessary to determine reliable sequence data (P. A. Pevzner et al., J. Biomol. Struc. & Dyn. 9:399-410, 1991; W. Bains, Genomics 11:295-301, 1991).
In spite of an overall optimistic outlook, there are still a number of potentially severe drawbacks to actual implementation of sequencing by hybridization. First and foremost among these is that 4n rapidly becomes quite a large number if chemical synthesis of all of the oligonucleotide probes is actually contemplated. Various schemes of automating this synthesis and compressing the products into a small scale array, a sequencing chip, have been proposed.
There is also a poor level of discrimination between a correctly hybridized, perfectly matched duplexes, and end mismatches. In part, these drawbacks have been addressed at least to a small degree by the method of continuous stacking hybridization as reported by a Khrapko et al. (FEBS Lett. 256:118-22, 1989). Continuous stacking hybridization is based upon the observation that when a single-stranded oligonucleotide is hybridized adjacent to a double-stranded oligonucleotide, the two duplexes are mutually stabilized as if they are positioned side-to-side due to a stacking contact between them. The stability of the interaction decreases significantly as stacking is disrupted by nucleotide displacement, gap or terminal mismatch. Internal mismatches are presumably ignorable because their thermodynamic stability is so much less than perfect matches. Although promising, a related problem arises which is the inability to distinguish between weak, but correct duplex formation, and simple background such as non-specific adsorption of probes to the underlying support matrix.
Detection is also monochromatic wherein separate sequential positive and negative controls must be run to discriminate between a correct hybridization match, a mismatch, and background. All too often, ambiguities develop in reading sequences longer than a few hundred base pairs on account of sequence recurrences. For example, if a sequence one base shorter than the probe recurs three times in the target, the sequence position cannot be uniquely determined. The locations of these sequence ambiguities are called branch points.
Secondary structures often develop in the target nucleic acid affecting accessibility of the sequences. This could lead to blocks of sequences that are unreadable if the secondary structure is more stable than occurs on the complementary strand.
A final drawback is the possibility that certain probes will have anomalous behavior and for one reason or another, be recalcitrant to hybridization under whatever standard sets of conditions ultimately used. A simple example of this is the difficulty in finding matching conditions for probes rich in G/C content. A more complex example could be sequences with a high propensity to form triple helices. The only way to rigorously explore these possibilities is to carry out extensive hybridization studies with all possible oligonucleotides of length “n” under the particular format and conditions chosen. This is clearly impractical if many sets of conditions are involved.
Among the early publications which appeared discussing sequencing by hybridization, E. M. Southern (WO 89/10977), described methods whereby unknown, or target, nucleic acids are labeled, hybridized to a set of nucleotides of chosen length on a solid support, and the nucleotide sequence of the target determined, at least partially, from knowledge of the sequence of the bound fragments and the pattern of hybridization observed. Although promising, as a practical matter, this method has numerous drawbacks. Probes are entirely single-stranded and binding stability is dependent upon the size of the duplex. However, every additional nucleotide of the probe necessarily increases the size of the array by four fold creating a dichotomy which severely restricts its plausible use. Further, there is an inability to deal with branch point ambiguities or secondary structure of the target, and hybridization conditions will have to be tailored or in some way accounted for each binding event. Attempts have been made to overcome or circumvent these problems.
R. Drmanac et al. (U.S. Pat. No. 5,202,231) is directed to methods for sequencing by hybridization using sets of oligonucleotide probes with random or variable sequences. These probes, although useful, suffer from some of the same drawbacks as the methodology of Southern (1989), and like Southern, fail to recognize the advantages of stacking interactions.
K. R. Khrapko et al. (FEBS Lett. 256:118-22, 1989; and J. DNA Sequencing and Mapping 1:357-88, 1991) attempt to address some of these problems using a technique referred to as continuous stacking hybridization. With continuous stacking, conceptually, the entire sequence of a target nucleic acid can be determined. Basically, the target is hybridized to an array of probes, again single-stranded, denatured from the array, and the dissociation kinetics of denaturation analyzed to determine the target sequence. Although also promising, discrimination between matches and mis-matches (and simple background) is low and, further, as hybridization conditions are inconstant for each duplex, discrimination becomes increasingly reduced with increasing target complexity.
Another major problem with current sequencing formats is the inability to efficiently detect sequence information. In conventional procedures, individual sequences are separated by, for example, electrophoresis using capillary or slab gels. This step is slow, expensive and requires the talents of a number of highly trained individuals, and, more importantly, is prone to error. One attempt to overcome these difficulties has been to utilize the technology of mass spectrometry.
Mass spectrometry of organic molecules was made possible by the development of instruments able to volatize large varieties of organic compounds and by the discovery that the molecular ion formed by volatization breaks down into charged fragments whose structures can be related to the intact molecule. Although the process itself is relatively straight forward, actual implementation is quite complex. Briefly, the sample molecule or analyte is volatized and the resulting vapor passed into an ion chamber where it is bombarded with electrons accelerated to a compatible energy level. Electron bombardment ionizes the molecules of the sample analyte and then directs the ions formed to a mass analyzer. The mass analyzer, with its combination of electrical and magnetic fields, separates impacting ions according to their mass/charge (m/e) ratios. From these ratios, the molecular weights of the impacting ions can be determined and the structure and molecular weight of the analyte determined. The entire process requires less than about 20 microseconds.
Attempts to apply mass spectrometry to the analysis of biomolecules such as proteins and nucleic acids have been disappointing. Mass spectrometric analysis has traditionally been limited to molecules with molecular weights of a few thousand daltons. At higher molecular weights, samples become increasingly difficult to volatize and large polar molecules generally cannot be vaporized without catastrophic consequences. The energy requirement is so significant that the molecule is destroyed or, even worse, fragmented. Mass spectra of fragmented molecules are often difficult or impossible to read. Fragment linking order, particularly useful for reconstructing a molecular structure, has been lost in the fragmentation process. Both signal to noise ratio and resolution are significantly negatively affected. In addition, and specifically with regard to biomolecular sequencing, extreme sensitivity is necessary to detect the single base differences between biomolecular polymers to determine sequence identity.
A number of new methods have been developed based on the idea that heat, if applied with sufficient rapidity, will vaporize the sample biomolecule before decomposition has an opportunity to take place. This rapid heating technique is referred to as plasma desorption and there are many variations. For example, one method of plasma desorption involves placing a radioactive isotope such as Californium-252 on the surface of a sample analyte which forms a blob of plasma. From this plasma, a few ions of the sample molecule will emerge intact. Field desorption ionization, another form of desorption, utilizes strong electrostatic fields to literally extract ions from a substrate. In secondary ionization mass spectrometry or fast ion bombardment, an analyte surface is bombarded with electrons which encourage the release of intact ions. Fast atom bombardment involves bombarding a surface with accelerated ions which are neutralized by a charge exchange before they hit the surface. Presumably, neutralization of the charge lessens the probability of molecular destruction, but not the creation of ionic forms of the sample. In laser desorption, photons comprise the vehicle for depositing energy on the surface to volatize and ionize molecules of the sample. Each of these techniques has had some measure of success with different types of sample molecules. Recently, there have also been a variety of techniques and combinations of techniques specifically directed to the analysis of nucleic acids.
Brennan et al. used nuclide markers to identify terminal nucleotides in a DNA sequence by mass spectrometry (U.S. Pat. No. 5,003,059). Stable nuclides, detectable by mass spectrometry, were placed in each of the four dideoxynucleotides used as reagents to polymerize cDNA copies of the target DNA sequence. Polymerized copies were separated electrophoretically by size and the terminal nucleotide identified by the presence of the unique label.
Fenn et al. describes a process for the production of a mass spectrum containing a multiplicity of peaks (U.S. Pat. No. 5,130,538). Peak components comprised multiply charged ions formed by dispersing a solution containing an analyte into a bath gas of highly charged droplets. An electrostatic field charged the surface of the solution and dispersed the liquid into a spray referred to as an electrospray (ES) of charged droplets. This nebulization provided a high charge/mass ratio for the droplets increasing the upper limit of volatization. Detection was still limited to less than about 100,000 daltons.
Jacobson et al. utilizes mass spectrometry to analyze a DNA sequence by incorporating stable isotopes into the sequence (U.S. Pat. No. 5,002,868). Incorporation required the steps of enzymatically introducing the isotope into a strand of DNA at a terminus, electrophoretically separating the strands to determine fragment size and analyzing the separated strand by mass spectrometry. Although accuracy was stated to have been increased, electrophoresis was necessary to isolate the labeled strand.
Brennan also utilized stable markers to label the terminal nucleotides in a nucleic acid sequence, but added the step of completely degrading the components of the sample prior to analysis (U.S. Pat. Nos. 5,003,059 and 5,174,962). Nuclide markers, enzymatically incorporated into either dideoxynucleotides or nucleic acid primers, were electrophoretically separated. Bands were collected and subjected to combustion and passed through a mass spectrometer. Combustion converts the DNA into oxides of carbon, hydrogen, nitrogen and phosphorous, and the label into sulfur dioxide. Labeled combustion products were identified and the mass of the initial molecule reconstructed. Although fairly accurate, the process does not lend itself to large scale sequencing of biopolymers.
A recent advancement in the mass spectrometric analysis of high molecular weight molecules in biology has been the development of time of flight mass spectrometry (TOF-MS) with matrix-assisted laser desorption ionization (MALDI). This process involves placing the sample into a matrix which contains molecules which assist in the desorption process by absorbing energy at the frequency used to desorp the sample. The theory is that volatization of the matrix molecules encourages volatization of the sample without significant destruction. Time of flight analysis utilizes the travel time or flight time of the various ionic species as an accurate indicator of molecular mass. There have been some notable successes with these techniques.
Beavis et al. proposed to measure the molecular weights of DNA fragments in mixtures prepared by either Maxam-Gilbert or Sanger sequencing techniques (U.S. Pat. No. 5,288,644). Each of the different DNA fragments to be generated would have a common origin and terminate at a particular base along an unknown sequence. The separate mixtures would be analyzed by laser desorption time of flight mass spectroscopy to determine fragment molecular weights. Spectra obtained from each reaction would be compared using computer algorithms to determine the location of each of the four bases and ultimately, the sequence of the fragment.
Williams et al. utilized a combination of pulsed laser ablation, multiphoton ionization and time of flight mass spectrometry. Effective laser desorption was accomplished by ablating a frozen film of a solution containing sample molecules. When ablated, the film produces an expanding vapor plume which entrains the intact molecules for analysis by mass spectrometry.
Even more recent developments in mass spectrometry have further increased the upper limits of molecular weight detection and determination. Mass spectrograph systems with reflectors in the flight tube have effectively doubled resolution. Reflectors also compensate for errors in mass caused by the fact that the ionized/accelerated region of the instrument is not a point source, but an area of finite size wherein ions can accelerate at any point. Spatial differences between the origination points of the particles, problematic in conventional instruments because arrival times at the detector will vary, are overcome. Particles that spend more time in the accelerating field will also spend more time in the retarding field. Therefore, all particles emerging from the reflector should be synchronous, vastly improving resolution.
Despite these advances, it is still not possible to generate coordinated spectra representing a continuous sequence. Furthermore, throughput is sufficiently slow so as to make these methods impractical for large scale analysis of sequence information.