DNA sequencing is a fundamental research tool with wide-ranging applications. A common approach to DNA sequencing involves the subcloning of a large DNA fragments as smaller, overlapping fragments, the sequences of which are subsequently determined using the dideoxynucleotide chain termination approach (Sanger and Coulson, Proc. Natl. Acad. Sci. USA 74: 5463 (1977)).
Subcloning, and the restriction mapping required to efficiently subclone fragments, is a time consuming and labor intensive process. However, given the limitations associated with the amount of sequence which can be determined from a single extension reaction, it is necessary to initiate new sequencing reactions at a distance of about every 300-400 base pairs along a fragment, the sequence of which is to be determined.
One alternative to the subcloning approach is described by Henikoff et al. in U.S. Pat. Nos. 4,843,003 and 4,889,799. More specifically, Henikoff et al. describe a method in which a vector containing a DNA sequence of interest is linearized by digestion at two restriction endonuclease recognition sites, one generating a 5' overhang and the other a blunt end or 3' overhang. Timed digestion with E. coli Exo III from the 5' overhang, followed by treatment with a single-strand-specific nuclease generates a nested array of deletions. Unfortunately, this technique also is limited by the need for conveniently located restriction endonuclease recognition sequences.
An alternative to the approach described above was outlined by Chang et al. (Gene 127: 95 (1993)). Chang et al. describe a method in which a single-stranded nick is introduced at a position adjacent to the site at which a DNA fragment having a sequence which is to be determined is inserted in a cloning vector. The nick in the DNA is then extended under controlled digestion conditions to produce a single-stranded gap. The single-stranded gap is then treated with a nuclease which specifically digests single-stranded DNA, thereby producing a deletion within the DNA sequence of interest.
Chang et al. specifically report that the single-stranded nick in the DNA of interest cannot be expanded by treatment with E. coli Exo III. Given the fact that Exo III is a well-understood, relatively inexpensive enzyme, Chang et al. note that this is an unfortunate finding (page 96, column 2). The development of protocols which would enable the use of Exo III in such a DNA sequencing strategy would represent an important improvement in the art.
In the last decade significant effort and resources have been devoted to the sequencing of entire genomes and expressed sequences of various organisms. The human genome specifies an estimated 60,000 to 100,000 proteins and the mouse genome a comparable number. Large-scale single-pass sequencing of cDNAs has been very successful in gene identification. The current version of UniGene, the NCBI clustering of known genes and expressed sequence tags, contains more than 60,000 clusters of human expressed sequences, more than 8,000 of which correspond to known genes. The corresponding collection from mouse contains more than 15,000 clusters, almost 4,500 of which correspond to known genes. Such efforts rely on normalized cDNA libraries from a variety of tissues.
Although cDNA clones need not be full length to provide valuable information by single-pass sequencing from their ends, the complete sequence of full-length cDNA clones provides the coding sequences for the proteins and, by comparing with genome sequence, the locations of introns. Full-length clones are also needed for expressing proteins for functional analysis. Several efforts are underway to generate normalized cDNA libraries that will be enriched for full-length clones (Bonaldo et al., Genome Res. 6(9): 791-806 (1996); Ohara et al., DNA Res. 4(1): 53-9 (1997); Suzuki et al., Gene. 200(1-2): 149-56 (1997)).
Ultimately, sequences of tens of thousands of full-length cDNAs from human and mouse will be needed for understanding the functions of human mRNAs and proteins. The variety of uses to which these sequences will be put requires that they be determined at the highest possible accuracy, certainly at or better than the 99.99% accuracy required of genome sequences. Obtaining the complete sequence of these cDNA clones, most of which will range in length from a few hundred to several thousands of base pairs, is not particularly efficient with the shotgun sequencing procedures currently used for high-throughput sequencing of large-insert genomic clones. Inserts must first be purified from vector fragments and then shotgun libraries must be prepared either from the insert itself (Ohara et al., DNA Res. 4(1): 53-9 (1997)) or from mixtures of inserts from different clones ligated together (Yu et al., Genome Res. 7(4): 353-8 (1997)). Primer walking may be a viable alternative, particularly for shorter clones: the cost of the many primers needed for primer walking has decreased markedly, whether obtained commercially, produced locally with high capacity synthesizers (Lashkari et al., Proc. Natl. Acad. Sci. USA 92(17): 7912-5 (1995); Rayner et al., Genome Res. 8(7): 741-7 (1998)), or generated by ligation from a hexamer library (Dunn et al., Anal Biochem. 228(1): 91-100 (1995)). Other possible approaches include sequencing by priming from the ends of an appropriately spaced collection of insertion elements selected after random insertion events (Strathmann et al., Proc. Natl. Acad. Sci. USA 88(4): 1247-50 (1991); Berg et al., Gene. 113(1): 9-16 (1992); Martin et al., Proc. Natl. Acad. Sci. USA 92(18): 8398-402 (1995); York et al., Nucleic Acids Res. 26(8): 1927-33 (1998)) or the use of nested deletions to generate sequencing substrates (reviewed in Ausubel, F. M., R. Brent, R. E. Kingston, D. D. Moore, J. G. Seidman, J. A. Smith, and K. Struhl. 1995. Current Protocols in Molecular Biology, vol. 1. Wiley, New York).
Sequencing highly repeated regions of human DNA, such as found near telomeres and centromeres, presents its own set of difficulties. Sequencing by producing nested deletions is a viable solution for sequencing such difficult DNA sequences.
Several procedures have been developed to generate nested deletions (Henikoff, S., Gene. 28(3): 351-9 (1984); Chang et al., Gene. 127(1): 95-8 (1993); Kawarabayasi et al., DNA Res. 1(6): 289-96 (1994); Shearer, G., Jr., Anal Biochem. 223(1): 105-10 (1994); Hattori et al., Nucleic Acids Res. 25(9): 1802-8 (1997); Ren et al., Anal Biochem. 245(1): 112-4; Fradkov et al., Anal Biochem. 258(1): 138-41 (1998)).
Vectors currently used for preparing and normalizing cDNA libraries are multi-copy (Soares et al., Proc. Natl. Acad. Sci. USA 91(20): 9228-32 (1994); Bonaldo et al., Genome Res. 6(9): 791-806 (1996); Ohara et al., DNA Res. 4(1): 53-9 (1997)). This may be convenient for obtaining the substantial amounts of DNA needed for normalization procedures or for preparing hybridization arrays with the resulting clones, but longer clones, and clones that can express proteins, are typically more stable in single-copy vectors, such as those based on the F or Pi lysogenic replicons (Shizuya et al., Proc. Natl. Acad. Sci. USA 89(18): 8794-7 (1992); Ioannou et al., Nat. Genet. 6(1): 84-9 (1994)). Thus, a single-copy vector might be expected to provide a more uniform representation of full-length cDNA clones than a multi-copy one.
Although a tremendous amount of progress has been made, the development of more efficient and accurate methods for high throughput sequencing are needed to further expedite the process.