A primary goal of any genomic sequencing project is to determine the entire DNA sequence for a target organism. A related goal is to construct ordered clone maps of DNA sequences at 100 kilobase (kb) resolution for these organisms (D. R. Cox, et al., Science, 265:2031, 1994). Integrated maps that localize clones together with polymorphic genetic markers are particularly useful for positionally cloning human disease genes (F. Collins, Nature Genet., 1:3, 1992). Strategies for large-scale genomic DNA sequencing currently require physical mapping, followed by detailed mapping, and finally sequencing. The level of mapping detail determines the amount of effort, or sequence redundancy, required to finish a project. Efficient strategies for performing the requisite experimentation are critical for sequencing and mapping chromosomes or entire genomes. Current strategies attempt to find a balance between mapping and sequencing efforts.
The starting point for an effective sequencing method is a complete ordered clone map of a genome. Putting together the cloned genome requires ordering and linking together all of the clones comprising the genomic DNA library. Mapping strategies can be “top-down” or “bottom-up”. The “top-down” strategy depends on the separation on pulsed field gels of large DNA fragments generated using rare restriction endonucleases for physical linkage of DNA markers and construction of a long-range map. The “bottom-up” strategy depends on identifying overlapping sequences in a large number of randomly selected clones by unique restriction enzyme “fingerprinting” and their assembly into overlapping sets of clones. The linking of these clones is not done physically, but in computers and requires the analysis of thousands of individual clones to generate complete maps. This process is labor intensive and expensive because the difficulties increase rapidly with larger genomes, requiring continual advances in mapping approaches, instrumentation and computational expertise (See, e.g., Venter et al., Science 280:1540, 1998). Regardless of the linking strategy, the common prior art approach relied on using as large of a fragment as possible in order to minimize the numbers of “puzzle pieces” that had to be linked to obtain the genomic map.
Current strategies for ordering clones build contiguous sequences (contigs) by reassembling contiguous stretches of DNA (See, e.g., Watson, J. D. et al (1992) Recombinant DNA, (W.H. Freeman and Company, New York), pp. 583-618) using short-range comparison data. The number of experiments needed for any short-range clone mapping approach increases with the number of clones in the library. A useful goal is to significantly reduce cost and increase throughput by achieving a number of required experiments largely independent of library size. For example, contigs of small genomic regions have been constructed by oligonucleotide fingerprinting of gridded cosmid filters (A. G. Craig et al., Nucleic Acids Res., 18:2653, 1990). However, complex hybridization probes generate data containing considerable noise, thus precluding high-resolution mapping of clones using this technique.
Currently, two competing strategies are being used to sequence the large genomes. The clone-by-clone (CBC) strategy has produced highly accurate sequences of E. coli (Blattner et al., Science 277:1453, 1997), yeast (Goffeau et al., Science 274:546, 1996), human chromosomes 22 (Dunham et al., Nature 402:489, 2000) and 21(Hattori et al., Nature 405:311, 2000), and draft sequence covering about 90% of the human genome. These successes have largely benefited from the construction of sequence-ready maps. For future projects such as the mouse genome, for which map resources are relatively scarce, the advantage of this strategy will be less obvious. The whole genome shotgun (WGS) strategy involves sequencing all of the naturally occurring DNA sequences (i.e. genomic DNA) constituting the genome of an organism without prior mapping of large clones. WGS sequencing essentially involves randomly breaking DNA into segments of various sizes and cloning these fragments into vectors. The clones are sequenced from both ends improving the efficiency of sequence overlapping assembly.
WGS obviates the need for a sequence-ready map, but relies heavily on immense computational power for assembling random shotgun reads into long continuous sequence contigs, which are finally anchored to chromosomes using other mapped sequence information. Recent success in applying this strategy to sequence the 120 Mb euchromatic portion of Drosophila genome provides proof of principle for WGS (Adams et al., Science 287:2185, 2000). This impressive achievement does not, however, guarantee that the strategy will work on the human or mouse genome. Each is more than 20 times larger than Drosophila and the computational requirements to perform the necessary pair-wise comparisons increase approximately as a square of the size of the genome. Indeed, the reported experience with the Drosophila WGS (Myers et al., Science 287:2196, 2000) indicates the achievable computational power will not be sufficient to assemble the human genome sequence purely from shotgun random reads and that inevitably binned sequence reads from the individual bacterial artificial chromosomes (BACs) in the public data base will have to be used to anchor the whole genome shotgun random reads in order to resolve ambiguities and lower the computational load.
As can be seen from the foregoing discussion, determining the complete sequence of complex mammalian or plant genomes to a high standard of accuracy remains a considerable problem. Thus, a need exists in the art for a sequencing method that can lead to the rapid identification of genes and regulatory sequences in complex eukaryotic genomes. In particular, there is a need to reduce the amount of computing power needed to sequence complex genomes. The present invention provides methods and systems for determining the sequence of the genome of an organism or species through the use of a novel, unobvious, and highly effective clone array pooled shotgun strategy. Such sequence information can be used for finding genes of known utility, determining structure/function properties of genes and their products, elucidating metabolic networks, understanding the growth and development of humans and other organisms, and making comparisons of genetic information between species. From these studies, diagnostic tests and pharmacological agents can be developed of great utility for preventing and treating human and other disease.