1. Technical Field
This invention is directed to methods for sequencing nucleic acids by positional hybridization, to procedures combining these methods with more conventional sequencing techniques, to the creation of probes useful for nucleic acid sequencing by positional hybridization, to diagnostic aids useful for screening biological samples for nucleic acid variations, and to methods for using these diagnostic aids.
2. Description of the Prior Art
Since the recognition of nucleic acid as the carrier of the genetic code, a great deal of interest has centered around determining the sequence of that code in the many forms which it is found. Two landmark studies made the process of nucleic acid sequencing, at least with DNA, a common and relatively rapid procedure practiced in most laboratories. The first describes a process whereby terminally labeled DNA molecules are chemically cleaved at single base repetitions (A. M. Maxim and W. Gilbert, Proc. Natl. Acad. Sci. USA 74:560-564, 1977). Each base position in the nucleic acid sequence is then determined from the molecular weights of fragments produced by partial cleavages. Individual reactions were devised to cleave preferentially at guanine, at adenine, at cytosine and thymine, and at cytosine alone. When the products of these four reactions are resolved by molecular weight, using, for example, polyacrylamide gel electrophoresis, DNA sequences can be read from the pattern of fragments on the resolved gel.
The second study describes a procedure whereby DNA is sequenced using a variation of the plus-minus method (F. Sanger et al., Proc. Natl. Acad. Sci. USA 74:5463-67, 1977). This procedure takes advantage of the chain terminating ability of dideoxynucleoside triphosphates (ddNTPs) and the ability of DNA polymerase to incorporate ddNTP with nearly equal fidelity as the natural substrate of DNA polymerase, deoxynucleoside triphosphates (dNTPs). Briefly, a primer, usually an oligonucleotide, and a template DNA are incubated together in the presence of a useful concentration of all four dNTPs plus a limited amount of a single ddNTP. The DNA polymerase occasionally incorporates a dideoxynucleotide which terminates chain extension. Because the dideoxynucleotide has no 3'-hydroxyl, the initiation point for the polymerase enzyme is lost. Polymerization produces a mixture of fragments of varied sizes, all having identical 3' termini. Fractionation of the mixture by, for example, polyacrylamide gel electrophoresis, produces a pattern which indicates the presence and position of each base in the nucleic acid. Reactions with each of the four ddNTPs allows one of ordinary skill to read an entire nucleic acid sequence from a resolved gel.
Despite their advantages, these procedures are cumbersome and impractical when one wishes to obtain megabases of sequence information. Further, these procedures are, for all practical purposes, limited to sequencing DNA. Although variations have developed, it is still not possible using either process to obtain sequence information directly from any other form of nucleic acid.
A new method of sequencing has been developed which overcomes some of the problems associated with current methodologies wherein sequence information is obtained in multiple discrete packages by hybridization. Instead of having a particular nucleic acid sequenced one base at a time, groups of contiguous bases are determined simultaneously. Advantages in speed, expense and accuracy are clear.
Two general approaches of sequencing by hybridization have been suggested. Their practicality has been demonstrated in pilot studies. In one format, a complete set of 4.sup.n nucleotides of length n is immobilized as an ordered array on a solid support and an unknown DNA sequence is hybridized to this array (K. R. Khrapko et al., J. DNA Sequencing and Mapping 1:375-88, 1991). The resulting hybridization pattern provides all n-tuple words in the sequence. This is sufficient to determine short sequences except for simple tandem repeats.
In the second format, an array of immobilized samples is hybridized with one short oligonucleotide at a time (Z. Strezoska et al., Proc. Natl. Acad. Sci. USA 88:10,089-93, 1991). When repeated N.sup.4 times for each oligonucleotide of length n, much of the sequence of all the immobilized samples would be determined. In both approaches, the intrinsic power of the method is that many sequenced regions are determined in parallel. In actual practice the array size is about 10.sup.4 to 10.sup.5.
Another powerful aspect of the method is that information obtained is quite redundant, especially as the size of the nucleic acid probe grows. Mathematical simulations have shown that the method is quite resistant to experimental errors and that far fewer than all probes are necessary to determine reliable sequence data (P. A. Pevzner et al., J. Biomol. Struc. & Dyn. 9:399-410, 1991; W. Bains, Genomics 11:295-301, 1991).
In spite of an overall optimistic outlook, there are still a number of potentially severe drawbacks to actual implementation of sequencing by hybridization. First and foremost among these is that 4.sup.n rapidly becomes quite a large number if chemical synthesis of all of the oligonucleotide probes is actually contemplated. Various schemes of automating this synthesis and compressing the products into a small scale array, a sequencing chip, have been proposed.
A second drawback is the poor level of discrimination between a correctly hybridized, perfectly matched duplexes, and an end mismatch. In part, these drawbacks have been addressed at least to a small degree by the method of continuous stacking hybridization as reported by a Khrapko et al. (FEBS Lett. 256:118-22, 1989). Continuous stacking hybridization is based upon the observation that when a single stranded oligonucleotide is hybridized adjacent to a double stranded oligonucleotide, the two duplexes are mutually stabilized as if they are positioned side to side due to a stacking contact between them. The stability of the interaction decreases significantly as stacking is disrupted by nucleotide displacement, gap, or terminal mismatch. Internal mismatches are presumably ignorable because their thermodynamic stability is so much less than perfect matches. Although promising, a related problem arise which is distinguishing between weak but correct duplex formation and simple background such as non-specific adsorption of probes to the underlying support matrix.
A third drawback is that detection is monochromatic. Separate sequential positive and negative controls must be run to discriminate between a correct hybridization match, a mis-match, and background.
A fourth drawback is that ambiguities develop in reading sequences longer than a few hundred base pairs on account of sequence recurrences. For example, if a sequence the same length of the probe recurs three times in the target, the sequence position cannot be uniquely determined. The locations of these sequence ambiguities are called branch points.
A fifth drawback is the effect of secondary structures in the target nucleic acid. This could lead to blocks of sequences that are unreadable if the secondary structure is more stable than occurs on the complementary strand.
A final drawback is the possibility that certain probes will have anomalous behavior and for one reason or another, be recalcitrant to hybridization under whatever standard sets of conditions that are ultimately used. A simple example of this is the difficulty in finding matching conditions for probes rich in G/C content. A more complex example could be sequences with a high propensity to form triple helices. The only way to rigorously explore these possibilities is to carry out extensive hybridization studies with all possible oligonucleotides of length n, under the particular format and conditions chosen. This is clearly impractical if many sets of conditions are involved.