The complete analysis of large complex genomes, such as genomes of higher eukaryotes, including human, requires the extensive isolation, purification and analysis of large fragments of DNA by cloning, generally in E. coli. In the past, the lambda bacteriophage cloning system has been used most frequently to generate genomic libraries. The lambda bacteriophage vectors usually accommodate inserts up to about 20 kb. Presently the primary system used to clone and manipulate large DNA fragments is that of cosmid vectors. Cosmid vectors allow the packaging of DNA fragments of up to about 45 kb in plasmids containing bacteriophage cos sites for in vitro packaging.
The analysis of complex genomes involves the application of both "top-down" and "bottom-up" mapping strategies. The "top-down" strategy depends on the separation on pulsed field gels of large DNA fragments generated using rare restriction endonucleases for physical linkage of DNA markers and the construction of long-range maps [Schwartz et al., Cell 37, 67 (1984); Southern et al., Nucleic Acids Res. 15, 5925 (1987); Burke et al., Science 236, 806 (1987)] The "bottom-up" strategy depends on identifying overlapping sequences in a large number of randomly selected bacteriophage or cosmid clones by unique restriction enzyme "fingerprinting" and their assembly into overlapping sets of clones. "Top down" mapping is inherently more rapid and less labor intensive but does not generate sets of DNA clones for further structural or biological analysis. "Bottom-up" mapping generates the required sets of overlapping clones but application of current strategies and pattern matching algorithms to mammalian genomes will require the analysis of thousands to tens of thousands of individual clones for the generation of complete maps.
In the past few years, "bottom-up" mapping strategies have been successfully applied to generate complete or partial genomic maps of E. coli, C. elegans and S. cerevisiae.
Olson et al., Proc. Natl. Acad. Sci. USA 83, 7826 (1986), fingerprinted 5000 randomly selected lambda clones containing inserts of about 15 kb of genomic DNA from S. cerevisiae, by measuring the restriction fragment lengths obtained upon double digestion with EcoRI and HindIII. They used a pattern matching algorithm to construct overlapping sets of clones (contigs) extending over about 60% of the S. cerevisiae genome.
Coulson et al., Proc. Natl. Acad. Sci. USA 83, 7821 (1986) adopted a somewhat different methodology to construct a physical map of the genome of Caenorhabditis elegans, a nematode having a genome of approximately 8.times.10.sup.7 base pairs. They digested cosmid DNAs with the restriction enzyme HindIII having a 6-bp specificity, filled the 5'-overhang with radioactive nucleotides, digested with the 4-bp specific enzyme Sau3A, and determined the size of the labeled fragments by electrophoresis in a sequencing gel followed by autoradiography. The mean size of the DNA inserts in the cosmid vectors was about 34 kb. Eight hundred sixty clusters of clones, totaling about 60% of the Caenorhabditis elegans genome, have been characterized.
Kohara et al., Cell 50, 495 (1987) analyzed 1025 lambda phage clones containing about 15.5-kb inserts of genomic E. coli DNA. For each clone they constructed a complete restriction map by means of eight restriction enzymes. The data for the 1025 clones were processed and sorted into 70 groups, including seven standing alone clones representing about 94% of the entire genome of E. coli.
While effective, the application of these "fingerprinting" and pattern matching strategies to mammalian genomes would require the individual analysis of tens or hundreds of thousands of clones for map construction as well as highly efficient computer algorithms for pattern recognition. Moreover, these and similar "fingerprinting" protocols require substantial amounts of overlap of 5 to 25% for the overlapping region to be detected. A theoretical analysis of "fingerprinting techniques" has suggested that the efficiency of the analysis is strongly dependent on the criteria used to declare overlaps between clones. According to Lander et al., Genomics 2, 231 (1988), the minimum detectable overlap has a major effect on the progress of the mapping project. Reducing the degree of overlap required for detection would substantially decrease the number of the clones which must be analyzed to obtain map closure.
Another way of detecting overlaps is the identification of overlapping clones by hybridization with RNA probes instead of pattern recognition. The identification of several bacteriophage-encoded RNA polymerases and the sequencing of their promoters has spawned a new technology for producing RNA probes. Cloning vectors are now available in which the promoters for a single polymerase, or for two different polymerases, lie adjacent to a cloning site. Transcription with any of the available polymerases enables one to produce large quantities of high-specific activity RNA probes which correspond to either the coding or the non-coding strands [Wahl et al., Methods in Enzymology 152, 572 (1987)].
Wahl et al., Proc. Natl. Acad. Sci. USA 84, 2160 (1987) (see also U.S. Ser. No. 181,836 filed Apr. 15, 1988) have designed special cosmid vectors for rapid genomic "walking" and restriction mapping. These vectors (designated as pWE for "walking easily") contain the transcription promoters from either bacteriophage SP6, T7, or T3 flanking a unique cloning site for the insertion of genomic DNA fragments. These vectors allow the synthesis of end-specific RNA probes directly from the DNA inserts, and are suitable for the detection of overlapping regions of several hundred bp in contiguous cosmids.
One practical limitation of cloning in cosmid vectors, including the above pWE vectors, is that most vectors require the initial preparation of very high quality genomic DNA, digestion to appropriate size range for cloning, and the careful purification of appropriately sized DNA fragments on gradients or gels [DiLella et al., Methods in Enzymology 152, 199 (1987)]. In the traditional cosmid cloning procedure, linearized cosmid vectors are dephosphorylated to avoid concatamerization, prior to ligation to the DNA fragments. Since the DNA inserts cannot be dephosphorylated, their size fractionation is unavoidable to avoid recombinational rearrangements caused by multiple inserts ligated into a single cosmid. For these manipulations, a substantial quantity of genomic DNA is required to construct a representative genomic library, and cosmid cloning has not been practical in situations where only submicrogram amounts of DNA can be isolated. Bates et al. Gene 26, 137 (1983) described cosmid vectors with two cos sites separated by a blunt-end restriction enzyme site. They found that the double cos-site vectors eliminate the need to prepare two separate cosmid arms, and the internal blunt-end restriction site prevents cosmid concatamerization. Thus, a double restriction enzyme digestion was found to be sufficient to prepare a vector for subsequent ligation with DNA fragments which were dephosphorylated to prevent their self-ligation. This technique eliminated the need to purify insert DNA of the proper size (30-45 kb).
The use of cosmid vectors with two or more cos sites has been shown to simplify the cloning procedure by eliminating complex preparation of cloning "arms" by Ehrich et al. in Gene 57, 229 (1987).