The study of molecular and cellular biology is focused on the macroscopic structure of cells. We now know that cells have a complex microstructure that determine the functionality of the cell. Much of the diversity associated with cellular structure and function is due to the ability of a cell to assemble various building blocks into diverse chemical compounds. The cell accomplishes this task by assembling polymers from a limited set of building blocks referred to as monomers. The key to the diverse functionality of polymers is based in the primary sequence of the monomers within the polymer and is integral to understanding the basis for cellular function, such as why a cell differentiates in a particular manner or how a cell will respond to treatment with a particular drug.
The ability to identify the structure of polymers by identifying their sequence of monomers is integral to the understanding of each active component and the role that component plays within a cell. By determining the sequences of polymers it is possible to generate expression maps, to determine what proteins are expressed, to understand where mutations occur in a disease state, and to determine whether a polysaccharide has better function or loses function when a particular monomer is absent or mutated.
Expression maps relate to determining mRNA expression patterns. The need to identify differentially expressed mRNAs is critical in the understanding of genetic programming, both temporally and spatially. Different genes are turned on and off during the temporal course of an organisms' life development, comprising embryonic, growth, and aging stages. In addition to developmental changes, there are also temporal changes in response to varying stimuli such as injury, drugs, foreign bodies, and stress. The ability to chart expression changes for specific sets of cells in time either in response to stimuli or in growth allows the generation of what are called temporal expression maps. On the other hand, there are also body expression maps, which include knowledge of differentially expressed genes for different tissues and cell types. Expression maps are different not only between species and between individuals, but also between diseased and disease-free states. Examination of differential gene expression has yielded key discoveries of genes in widely varying disciplines, such as signal transduction (Smith et al., 1990), circadian rhythms (Loros et al., 1989), fruit ripening (Wilinson et al., 1995), hunger (Qu et al., 1996), cell cycle control (el-Deiry et al., 1993), apoptosis (Woronicz et al., 1994), and ischemic injury (Wang et al., 1995), among many others. Since generation of expression maps involve the sequencing and identification of cDNA or mRNA, more rapid sequencing necessarily means more rapid generation of multiple expression maps.
Currently, only 1% of the human genome and an even smaller amount of other genomes have been sequenced. In addition, only one very incomplete human body expression map using expressed sequence tags has been achieved (Adams et al., 1995). Current protocols for genomic sequencing are slow and involve laborious steps such as cloning, generation of genomic libraries, colony picking, and sequencing. The time to create even one partial genomic library is on the order of several months. Even after the establishment of libraries, there are time lags in the preparation of DNA for sequencing and the running of actual sequencing steps. Given the multiplicative effect of these unfavorable facts, it is evident that the sequencing of even one genome requires an enormous investment of money, time, and effort.
In general DNA sequencing is currently performed using one of two methods. The first and more popular method is the dideoxy chain termination method described by Sanger et al. (1977). This method involves the enzymatic synthesis of DNA molecules terminating in dideoxynucleotides. By using the four ddNTPs, a population of molecules terminating at each position of the target DNA can be synthesized. Subsequent analysis yields information on the length of the DNA molecules and the base at which each molecule terminates (either A, C, G, or T). With this information, the DNA sequence can be determined. The second method is Maxam and Gilbert sequencing (Maxam and Gilbert, 1977), which uses chemical degradation to generate a population of molecules degraded at certain positions of the target DNA. With knowledge of the cleavage specificities of the chemical reactions and the lengths of the fragments, the DNA sequence is generated. Both methods rely on polyacrylamide gel electrophoresis and photographic visualization of the radioactive DNA fragments. Each process takes about 1-3 days. The Sanger sequencing reactions can only generate 300-800 bases in one run.
Methods to improve the output of sequence information using the Sanger method also have been proposed. These Sanger-based methods include multiplex sequencing, capillary gel electrophoresis, and automated gel electrophoresis. Recently, there has also been increasing interest in developing Sanger independent methods as well. Sanger independent methods use a completely different methodology to realize the base information. This category contains the most novel techniques, which include scanning electron microscopy (STM), mass spectrometry, enzymatic luminometric inorganic pyrophosphate detection assay (ELIDA) sequencing, exonuclease sequencing, and sequencing by hybridization. A brief summary of these methods is set forth below.
Currently, automated gel electrophoresis is the most widely used method of large-scale sequencing. Automation requires reading of fluorescently labeled Sanger fragments in real time with a charge coupled device (CCD) detector. The four different dideoxy chain termination reactions are run with different labeled primers. The reaction mixtures are combined and co-electrophoresed down a slab of polyacrylamide. Using laser excitation at the end of the gel, the separated DNA fragments are resolved and the sequence determined by computer. Many automated machines are available commercially, each employing different detection methods and labeling schemes. The most efficient of these is the Applied Biosystems Model 377XL, which generates a maximum actual rate of 115,200 bases per day.
In the method of capillary gel-electrophoresis, reaction samples are analyzed by small diameter, gel-filled capillaries. The small diameter of the capillaries (50 μm) allows for efficient dissipation of heat generated during electrophoresis. Thus, high field strengths can be used without excessive Joule heating (400 V/m), lowering the separation time to about 20 minutes per reaction run. Not only are the bases separated more rapidly, there is also increased resolution over conventional gel electrophoresis. Furthermore, many capillaries are analyzed in parallel (Wooley and Mathies, 1995), allowing amplification of base information generated (actual rate is equal to 200,000 bases/day). The main drawback is that there is not continuous loading of the capillaries since a new gel-filled capillary tube must be prepared for each reaction. Capillary gel electrophoresis machines have recently been commercialized.
Multiplex sequencing is a method which more efficiently uses electrophoretic gels (Church and Kieffer-Higgins, 1988). Sanger reaction samples are first tagged with unique oligomers and then up to 20 different samples are run on one lane of the electrophoretic gel. The samples are then blotted onto a membrane. The membrane is then sequentially probed with oligomers that correspond to the tags on the Sanger reaction samples. The membrane is washed and reprobed successively until the sequences of all 20 samples are determined. Even though there is a substantial reduction in the number of gels run, the washing and hybridizing steps are as equally laborious as running electrophoretic gels. The actual sequencing rate is comparable to that of automated gel electrophoresis.
Sequencing by mass spectrometry was first introduced in the late 80's. Recent developments in the field have allowed for better sequence determination (Crain, 1990; Little et al., 1994; Keough et al., 1993; Smimov et al., 1996). Mass spectrometry sequencing first entails creating a population of nested DNA molecules that differ in length by one base. Subsequent analysis of the fragments is performed by mass spectrometry. In one example, an exonuclease is used to partially digest a 33-mer (Smimov, 1996). A population of molecules with similar 5′ ends and varying points of 3′ termination is generated. The reaction mixture is then analyzed. The mass spectrometer is sensitive enough to distinguish mass differences between successive fragments, allowing sequence information to be generated.
Mass spectrometry sequencing is highly accurate, inexpensive, and rapid compared to conventional methods. The major limitation, however, is that the read length is on the order of tens of bases. Even the best method, matrix-assisted laser desorption ionization time-of-flight (MALDI-TOF) mass spectroscopy (Smimov et al., 1996), can only achieve maximum read lengths of 80-90 base pairs. Much longer read lengths are physically impossible due to fragmentation of longer DNA at guanidines during the analysis step. Mass spectrometry sequencing is thus limited to verifying short primer sequences and has no practical application in large-scale sequencing.
The Scanning tunneling microscope (STM) sequencing (Ferrell, 1997) method was conceived at the time the STM was commercially available. The initial promise of being able to read base-pair information directly from the electron micrographs no longer holds true. DNA molecules must be placed on conducting surfaces, which are usually highly ordered pyrolytic graphite (HOPG) or gold. These lack the binding sites to hold DNA strongly enough to resist removal by the physical and electronic forces exerted by the tunneling tip. With difficulty, DNA molecules can be electrostatically adhered to the surfaces. Even with successful immobilization of the DNA, it is difficult to distinguish base information because of the extremely high resolutions needed. With current technology, purines can be distinguished from pyrimidines, but the individual purines and pyrimidines cannot be identified. The ability to achieve this feat requires electron microscopy to be able to distinguish between aldehyde and amine groups on the purines and the presence or absence of methyl groups on the pyrimidines.
Enzymatic luminometric inorganic pyrophosphate detection assay (ELIDA) sequencing uses the detection of pyrophosphate release from DNA polymerization to determine the addition of successive bases. The pyrophosphate released by the DNA polymerization reaction is converted to ATP by ATP sulfurylase and the ATP production is monitored continuously by firefly luciferase. To determine base specificity, the method uses successive washes of ATP, CTP, GTP, and TTP. If a wash for ATP generates pyrophosphate, one or more adenines are incorporated. The number of incorporated bases is directly proportional to the amount of pyrophosphate generated. Enhancement of generated sequence information can be accomplished with parallel analysis of many ELIDA reactions simultaneously.
The main disadvantage is the short read length. Ronaghi et al. (1996) have only achieved a maximum read length of 15 bases because of the multiple washings needed. Since there are four washes per base read, this means that a total of 400 washes mush be performed for a read length of a hundred bases. If there is even 1% loss of starting material for each wash, after 400 washes there would be 1.8% of the starting material remaining, which is insufficient for detection.
Exonuclease sequencing involves a fluorescently labeled, single-stranded DNA molecule which is suspended in a flowing stream and sequentially cleaved by an exonuclease. Individual fluorescent bases are then released and passed through a single molecule detection system. The temporal sequence of labeled nucleotide detection corresponds to the sequence of the DNA (Ambrose et al., 1993; Davis et al., 1992; Jett et al., 1989). Using a processive exonuclease, it theoretically is possible to sequence 10,000 bp or larger fragments at a rate of 10 bases per second.
In practice, exonuclease sequencing has encountered many difficulties in each of the steps. The labeling step requires that all four bases in the DNA be tagged with different fluorophores. Sterically, this is extremely unfavorable. Ambrose et al., 1993 has achieved complete labeling of two bases on a 7 kb strand of M13 DNA. Furthermore, difficult optical trapping is needed to suspend DNA molecules in a flowing stream. The step is time intensive and requires considerable expertise. Lastly, single molecules of fluorophore need to be detected with high efficiency. Even a 1% error is significant. Improvements in detection from 65% to 95% efficiency have been achieved. The efficiency of detection has been pushed to the limit and it would be difficult to achieve further improvements.
In the sequencing by hybridization method, a target DNA is sequentially probed with a set of oligomers consisting of all the possible oligomer sequences. The sequence of the target DNA is generated with knowledge of the hybridization patterns between the oligomers and the target (Bains, 1991; Cantor et al., 1992; Drmanac et al., 1994). There are two possible methods of probing target DNA. The “Probe Up” method includes immobilizing the target DNA on a substrate and probing successively with a set of oligomers. “Probe Down” on the other hand requires that a set of oligomers be immobilized on a substrate and hybridized with the target DNA. With the advent of the “DNA chip,” which applies microchip synthesis techniques to DNA probes, arrays of thousands of different DNA probes can be generated on a 1 cm2 area, making Probe Down methods more practical. Probe Up methods would require, for an 8-mer, 65,536 successive probes and washings, which would take an enormous amount of time. On the other hand, Probe Down hybridization generates data in a few seconds. With perfect hybridization, 65,536 octamer probes would determine a maximum of 170 bases. With 65,536“mixed” 11-mers, 700 bases can be generated.
In practice, Probe Up methods have been used to generate sequences of about 100 base pairs. Imperfect hybridization has led to difficulties in generating adequate sequence. Error in hybridization is amplified many times. A 1% error rate reduces the maximum length that can be sequenced by at least 10%. Thus if 1% of 65,536 oligonucleotides gave false positive hybridization signals when hybridizing to a 200-mer DNA target, 75% of the scored “hybridizations” would be false (Bains, 1997). Sequence determination would be impossible in such an instance. The conclusion is that hybridization must be extremely effective in order to generate reasonable data. Furthermore, sequencing by hybridization also encounters problems when there are repeats in sequences that are one base less than the length of the probe. When such sequences are present, multiple possible sequences are compatible with the hybridization data.
The most common limitation of most of these techniques is a short read length. In practice a short read length means that additional genetic sequence information needs to be sequenced before the linear order of a target DNA can be deciphered. The short fragments have to be bridged together with additional overlapping fragments. Theoretically, with a 500 base read length, a minimum of 9×109 bases need to be sequenced before the linear sequence of all 3×109 bases of the human genome are properly ordered. In reality, the number of bases needed to generate a believable genome is approximately 2×1010 bases. Comparisons of the different techniques show that only the impractical exonuclease sequencing has the theoretical capability of long read lengths. The other methods have short theoretical read lengths and even shorter realistic read lengths. To reduce the number of bases that need to be sequenced, it is clear that the read length must be improved.
Protein sequencing generally involves chemically induced sequential removal and identification of the terminal amino acid residue, e.g., by Edman degradation. See Stryer, L., Biochemistry, W. H. Freeman and Co., San Francisco (1981) pp. 24-27. Edman degradation requires that the polypeptide have a free amino group which is reacted with an isothiocyanate. The isothiocyanate is typically phenyl isothiocyanate. The adduct intramolecularly reacts with the nearest backbone amide group of the polymer thereby forming a five membered ring. This adduct rearranges and the terminal amino acid residue is then cleaved using strong acid. The released phenylthiohydantoin (PTH) of the amino acid is identified and the shortened polymer can undergo repeated cycles of degradation and analysis.
Further, several new methods have been described for carboxy terminal sequencing of polypeptides. See Inglis, A. S., Anal. Biochem. 195:183-96 (1991). Carboxy terminal sequencing methods mimic Edman degradation but involve sequential degradation from the opposite end of the polymer. See Inglis, A. S., Anal. Biochem. 195:183-96 (1991). Like Edman degradation, the carboxy-terminal sequencing methods involve chemically induced sequential removal and identification of the terminal amino acid residue.
More recently, polypeptide sequencing has been described by preparing a nested set (sequence defining set) of polymer fragments followed by mass analysis. See Chait, B. T. et al., Science 257:1885-94 (1992). Sequence is determined by comparing the relative mass difference between fragments with the known masses of the amino acid residues. Though formation of a nested (sequence defining) set of polymer fragments is a requirement of DNA sequencing, this method differs substantially from the conventional protein sequencing method consisting of sequential removal and identification of each residue. Although this method has potential in practice it has encountered several problems and has not been demonstrated to be an effective method.
Each of the known methods for sequencing polymers has drawbacks. For instance most of the methods are slow and labor intensive. The gel based DNA sequencing methods require approximately 1 to 3 days to identify the sequence of 300-800 units of a polymer. Methods such as mass spectroscopy and ELIDA sequencing can only be performed on very short polymers.
An enormous need exists for de noveau polymer sequence determination. The rate of sequencing has limited the capability to generate multiple body and temporal expression maps which would undoubtedly aid the rapid determination of complex genetic function. A need also exists for improved methods for analyzing polymers in order to speed up the rate at which diagnosis of diseases and preparation of new medicines is carried out.