The ability to determine nucleotide sequences has had enormous impact on biology, medicine and biotechnology. An appreciation of the benefits of knowing the nucleotide sequences of genes, chromosomes, and entire genomes has led to the current proposals to determine the nucleotide sequence of the human genome and the genomes of other well studied or economically important organisms.
Cloning and mapping specific DNA fragments is an important part of the strategy for sequencing large genomes. The entire human genome of about 3.times.10.sup.9 base pairs could be contained in a set of about 100,000 cosmids, each of which contains about 40,000 or more base pairs of human DNA. Even larger segments can be cloned in yeast artificial chromosomes. The genome sequencing problem then reduces to the problem of sequencing a large number of DNAs of 40,000 or more base pairs. Such an enterprise represents a tremendous increase in scale over the most ambitious sequencing projects that have been undertaken heretofore. If cosmids were sequenced at the rate of one a day, a formidable task for a sequencing center using today's technology, centuries would be required to complete the task.
Currently useful methods for determining nucleotide sequence involve generating nucleic acid fragments having defined ends and resolving them according to size, using gel electrophoresis. These defined fragments are produced chemically (Maxam & Gilbert, Proc. Nat. Acad. Sci. USA, 74, 560-564 (1977); Methods in Enzymology 65, 499-560 (1980)), enzymatically (Sanger et al., Proc. Nat. Acad. Sci. USA 74, 5463-5467 (1977)), or by some combination of the two, and are typically identified in electrophoresis patterns by radioactivity, fluorescence or chemical reactivity.
The enzymatic sequencing technique has been highly developed, and several different DNA polymerases and reverse transcriptases are used for this purpose. These enzymes can be used for sequencing double-stranded or single-stranded DNA or RNA. Oligonucleotide primers direct DNA synthesis from a specific site in the molecule, which generates the common end needed for sequence analysis. The variable end is typically generated by incorporation of specific chain terminators, such as dideoxynucleotide triphosphates, or by incorporation of nucleoside triphosphate derivatives and subsequent cleavage of the molecule at the site of incorporation.
Specific priming is critical for the success of the enzymatic sequencing technique. Much is known about the specific association between oligonucleotides and longer nucleic acids, and about the ability of specifically associated oligonucleotides to prime DNA synthesis by the enzymes used for nucleotide sequencing (for example, M. Smith, in "Methods of DNA and RNA Sequencing", edited by S.M. Weissman, Praeger Publishers, New York, pp 23-68, 1983). Oligonucleotides as short as three or four bases long have been reported to prime DNA synthesis, and a mixture of hexamers is widely used to prime random DNA synthesis for labeling hybridization probes. Oligonucleotides of length 6 or longer are useful for priming specific sequencing reactions.
In practice, blocks of nucleotide sequence up to several hundred but rarely as long as a thousand nucleotides can be determined from the products of a single sequencing reaction or set of reactions. Cosmid DNAs, and in fact most nucleic acids of interest, are much longer than the few hundred nucleotides that is the basic unit of sequence determination. Therefore, a substantial part of the effort involved in sequencing genes or genomes, or almost any nucleic acid, must be devoted to obtaining and assembling the many individual blocks of a few hundred nucleotides of sequence that make up the entire nucleic acid to be sequenced. If an average of 500 nucleotides of sequence could be obtained in each analysis, a minimum of 160 sets of sequencing reactions would have to be prepared and analyzed to obtain the sequence of both strands of one cosmid DNA.
Several strategies have been developed for obtaining and ordering the many individual blocks of sequence needed to determine the entire sequence of larger molecules. One strategy is to use restriction enzymes to obtain and map specific fragments of the DNA molecule. The nucleotide sequences of appropriate fragments are determined, and the sequence of the entire molecule is assembled :from the known positions of the fragments. As example of the use of this strategy is the determination of the sequence of T7 DNA, a double-stranded molecule about 40,000 bp long (Dunn & Studier, J. Mol. Biol. 166, 477-535 (1983)). However, such a strategy is too labor intensive to be economical for sequencing large numbers of DNA molecules.
A more typical strategy is to subclone random fragments of the DNA into a cloning vector, typically derived from M13. The sequence of the cloned DNA is usually determined by the enzymatic sequencing technique, starting from a unique priming site within the vector DNA. Randomly selected subclones are sequenced, and the sequence of the original DNA is reconstructed from overlaps among the many blocks of sequence obtained from the different subclones. The sequence of lambda DNA, about 48,500 bp long (Sanger et al., J. Mol. Biol. 162, 729-773 (1982)), was determined by extensive use of such a strategy. Although relatively efficient in the early stages, a random cloning strategy becomes highly redundant in the later stages. In a purely random strategy, perhaps ten times the minimum possible number of sequence analyses may have to be done before all of the blocks of sequence can be overlapped. In practice, labor intensive mapping techniques are often used to close gaps.
Modifications have improved the efficiency of random cloning strategies. The length of continuous sequence that can be generated from a single priming site in a cloning vector can be extended considerably by generating sets of nested deletions that bring different portions of the DNA close to the priming site (Barnes, Methods in Enzymology, 152, 538-(1987)). However, this remains relatively labor intensive for a large scale sequencing effort. Multiplexing improves sequencing efficiency by allowing a single gel electrophoresis pattern to be probed repeatedly to determine the sequence of many different cloned DNAs (Church & Kieffer-Higgins, Science 240, 185-188 (1988)). However, all subcloning strategies suffer from the necessity to prepare many different clones and isolate DNA from each of them, an effort that will typically be comparable to that required to do the sequence analyses themselves.
A directed priming, or "walking" strategy allows the sequence to be determined directly from a nucleic acid molecule of interest without mapping or subcloning, a considerable savings in effort. To use directed priming, at least a small portion of the nucleotide sequence in the molecule must be known or determined in some other way. This known sequence information is used to synthesize a primer for enzymatic sequencing reactions that will extend the sequence into the unknown region. Such primers are synthesized by well known techniques or can be purchased commercially and are typically at least 16 nucleotides long, so as to be unique in the entire molecule. In order to continue extending the sequence further along the molecule, a new primer must be synthesized for every few hundred nucleotides of sequence obtained. Although a directed priming strategy eliminates the considerable effort needed for mapping or subcloning, the cost of primers nevertheless makes the directed priming strategy very expensive for large scale sequencing.
The recently described polymerase chain reaction (PCR) for amplifying specific segments of a DNA molecule is also being used to prepare samples for sequencing (Saiki et al., Science 239, 487-491 (1988); Stoflet et at., Science 239, 491-494 (1988)). This technique can eliminate the subcloning steps, and the PCR primers themselves can be used as primers for sequencing by the enzymatic technique. However, the use of this technique requires knowledge of the nucleotide sequence flanking the region to be amplified, information that is generally not available at the outset, and the cost of primers would be comparable to that for the directed priming strategy.
Although determination of nucleotide sequences has become routine, high volume sequencing is still a difficult problem. The need for methods that allow more efficient high volume sequencing is widely recognized and is being addressed in various ways. Machines are being developed to carry out sequencing reactions and to automate DNA sample preparation and collection of data. Completely novel sequencing methods that do not require resolution of DNA fragments by gel electrophoresis are also being explored. For example, Drmanac et al. (Genomics 4, 114-128 (1989)) have proposed a method based on the pattern of hybridization of oligonucleotides to the DNA to be sequenced. However, these initiatives have not yet had a practical impact.
The current state of the art in high volume sequencing was summarized in a brief report in Science (242, 1245, Dec. 2, 1988). Bart Barrell and Ellson Chen, whose laboratories have led the way in high volume sequencing and had sequenced the largest contiguous stretches of DNA at that time, reportedly concluded that the current technology realistically allows one skilled technician to sequence about 50,000 bases a year, and even that output is difficult to sustain. This rate of sequencing is still far short of the capacity needed for projects like sequencing the human genome.