The determination of the nucleotide sequence of DNA has become a routine analytical procedure for research in molecular biology. Current methods of sequencing involve the subdivision of a parent DNA segment into numerous, overlapping pieces all of which have one end in common. The pieces of different sizes are separated into different bands by a polya crylamide electrophoretic gel. The bands correspond to the nucleotides making up to the DNA sequence of each piece. The bands of all of the pieces are read off of the gel to obtain the complete sequence of up to about 300 nucleotides of DNA adjacent to the common end. The actual accumulation of sequence data for short (less than 300 nucleotides) DNA segments using this approach is relatively rapid if either the dideoxynucleotide triphosphate ("dideoxy") chain termination method of Sanger [Sanger, Nicklen and Coulson, Proc. Natl. Acad. Sci. (hereinafter, PNAS) USA, 74 (12), 5463 (1977)], or the chemical degradation reactions of Maxam and Gilbert [Maxam and Gilbert, Methods in Enzymology, 65,499 (1980)] are used.
While the complete sequence analysis of short DNA segments can be done relatively rapidly, the rapid, complete sequence analysis of long DNA segments (greater than 300 nucleotides) has proven significantly more difficult. This is due to the fact that electrophoretic gel band resolution exponentially decreases as the length of the piece of DNA to be electrophoresed increases. While separation between adjacent bands is readily visible when the pieces of DNA are less than approximately 200 nucleotides long, the exponential decrease in resolution of adjacent bands puts a practical upper limit of approximately 300 nucleotides on the length of a DNA segment that can be sequenced using a gel.
In the past, complete sequencing of long DNA segments by the Sanger method has been accomplished using a variety of techniques that involve the subdivision of the segment into smaller pieces followed by subcloning of the small pieces. These techniques make use of recombinant DNA technology, using DNA cloning vehicles and other tools of genetic engineering, to produce a cloned set of DNA pieces for sequencing. See Old and Primrose (Eds), Principles of Gene Manipulation, An Introduction to Genetic Engineering, 2nd Ed., University of California Press, (1981) for a survey of such techniques. The prior techniques may be classified into two broad groupings: (1) random subdivision of the parent DNA segment followed by random selection of subclones for sequencing [see Anderson, Nucl. Acids Res., 8,5541 (1980); Messing, Crea and Seeburg, Nucl. Acids Res. 9, 309 (1981); Messing, Methods in Enzymology, 101, 20 (1983); and Sanger, Coulson, Hong, Hill and Peterson, J. Mol. Biol., 162, 729 (1982)]; and, (2) random subdivision of the parent DNA followed by nonrandom selection of subclones for sequencing [see Frischauf, Garoff, and Lehrach, Nucl. Acids Res., 8, 5541 (1980); Barnes, Bevan and Son, Methods in Enzymology, 101, 98 (1983), Barnes and Bevan, Nucl. Acids Res. 11, 349 (1983); Hong, J. Mol. Biol., 158, 4298 (1982); and Poncz, Solowiejzyk, Ballantine, Schwartz and Surrey, PNAS USA, 79, 4298 (1982)].
In the random subdivision/random selection approach to sequencing, long segments of DNA are randomly divided into smaller pieces. The pieces are subcloned into single-stranded phage vectors and selected at random. The major disadvantage with this technique is that complete sequencing of large segments becomes increasingly more tedious and, thus, more difficult when only a small portion of the sequence remains to be determined. This difficulty arises because the remaining small portions of DNA require the use of relatively time-consuming nonrandom sequencing procedures such as restriction enzyme analysis and further subcloning. Additionally, repeating portions of the sequence can cause ambiguities in the computer-assisted alignment of overlapping pieces.
The problems associated with random DNA sequencing techniques have led to the development of a second approach to the sequencing of long DNA segments. In this approach, the DNA segment to be sequenced, which is usually present as part of a circular recombinant DNA molecule is first broken at random to create linear DNA molecules. The linear DNA molecules are then cleaved with a restriction enzyme which is chosen so as to release the target DNA from the vector and which introduces a common end to be used for sequencing. The randomly divided target segments may be selected on the basis of size either before or after the final step of reinsertion into a cloning vector. This approach may also involve the introduction of a common site at different points located about 200-300 nucleotides apart in the long target DNA segment, which helps make all regions accessible to DNA sequencing methods after the steps mentioned above of release from the vector and size selection.
Hong (supra, 1982) describes the use of one form of this nonrandom selection technique to linearize cloned DNA by making double-stranded cuts in the M13 DNA molecule at random. He then cut, with a restriction endonuclease, the linear double-stranded DNA adjacent to the primer binding site on the M13 vector portion of the DNA molecule before recircularizing by ligation. This was followed by transfection of cells to produce a mixture of deletions wherein the primer binding site was placed adjacent to random points within the long DNA segment. Next, the nonrandom selection of deletion clones was attempted by the choice of bacteriophage plaques, since large plaques are made by bacteriophages with extensive deletions and small plaques are made by bacteriophages with small deletions. Unfortunately, this selection method was found to be extremely crude and unreliable.
Hong's technique is similar to an earlier approach which constructed deletions in DNA carried in plasmid vectors [Frischauf et al., supra, (1980)]. In that approach, the nonrandom selection was done using gel electrophoresis prior to circularization. Unfortunately, obtaining an ordered set of deletions that differ by only about 200 nucleotides for DNA carried in bacteriophage vectors has proven to be impractical because of the larger size of these vectors. Bacteriophage vectors are most useful for sequencing long segments of DNA because they allow the direct application of the Sanger dideoxynucleotide sequencing method.
Another nonrandom selection method is that of Poncz et al. (supra, 1982) wherein the exonuclease Bal 31 is used to obtain a set of randomly deleted fragments, which are subcloned into an M13 bacteriophage vector. As best understood, it appears as though this method suffers from the same difficulties as that of Hong (supra, 1982). The practicality of this method is also unclear since neither the cloning efficiency nor the actual distribution of deletion breakpoints in the segments are described in the Poncz publication.
Still another example is the nonrandom method of Barnes, Bevans and Son (Methods in Enzymology, supra, 1983), wherein infectious M13 bacteriophages were fractionated by agarose gel electrophoresis after random deletions were made in the inserted, target DNA by a series of enzymatic treatments. The result is a group of cloned DNA fragments with a succession of deleted intervals. For each interval, half of the deletion breakpoints are within 1000 nucleotides (Barnes and Bevan, supra, 1983). Although very long DNA segments can be usefully subdivided using this technique, random selection within each fairly large interval is necessary in order to obtain deletions that differ in length by approximately 200 nucleotides, necessary for sequence determination.
The advantages of nonrandom sequencing extend beyond sequencing of large DNA molecules. For example, nonrandom sequencing allows a particular region within a larger cloned DNA fragment to be targeted for dideoxy sequencing or other manipulations without the need of extensive prior knowledge of the restriction enzyme sites within that region. The deletions also can assist in mapping genes, regulatory regions, and even functionally significant regions within genes.
The strategies described briefly above represent prior attempts to subdivide a large segment into smaller overlapping intervals and to nonrandomly select the resulting subclones for sequence analysis. The main disadvantage of these techniques is that the size of the intervals between the generated deletions is generally larger (more than approximately 200 nucleotides) than can be practically sequenced without choosing and sorting a great many clones at random from each interval. Further, none of the above-described nonrandom methods offer significant advantages over purely random methods for sequencing a large DNA segment as attested to by the fact that none of the methods is widely employed at present.
In general, this invention describes a nonrandom process for initially generating overlapping shortened derivatives of a long DNA segment, suitable for DNA sequencing. As will be better understood from the following description, it differs from the methods described above which rely on random generation of deletions, primarily in that nonrandom selection is unnecessary. As a result of this difference, intervals between successive deletion breakpoints can be chosen which are sufficiently small to allow complete sequencing of long DNA segments.