Several platforms for high throughput DNA sequencing have recently emerged (See Shendure & Ji (2008) Nat Biotechnol 26, 1135-45 for a comprehensive recent review). The most noticeable ones include the Roche/454 Genome FLX (Margulies et al. (2005) Nature 437, 376-80; Rothberg & Leamon (2008) Nat Biotechnol 26, 1117-24), Illumina/Solexa Genome Analyzer (Fedurco et al. (2006) Nucleic Acids Res. 34, e22; Turcatti et al. (2008) Nucleic Acids Res 36, e25), Helicos Heliscope (Harris, T. D. et al. (2008) Science 320, 106-9) and Life Technologies/ABI SOLiD system (Shendure et al. (2005) Science 309, 1728-32; Cloonan et al. (2008) Nat Methods 5, 613-9; Valouev et al. (2008) Genome Res 18, 1051-63). With these streamlined technologies, the cost for genome resequencing has been dramatically reduced. Except for the SOLiD system, which is based on DNA sequencing by ligation, all other platforms are based on the DNA sequencing by synthesis (SBS) method where the DNA sequence is determined by the cyclic addition of nucleotide bases one base type at a time using either natural nucleotides or fluorescently-labeled nucleotides with a reversible terminator.
The Solexa/Illumina system utilizes SBS with reversible terminators to sequence clonal DNA clusters amplified by in situ bridge PCR. All of the nucleotides in this method are fluorescently labeled, and blocked at the 3′ hydroxyl group with a reversible termination group. The sequence of the template is interrogated one base at a time by performing cyclic single base extension (i.e., each of the four nucleotide bases in sequence) from a primer. The 3′-OH blocking group and fluorescent label are cleaved before each cycle. Even though the technology is more streamlined, scalable, and has a higher throughput per run, sequence read length is quite limited and accuracy is low. The limited read length is most likely due to the use of a relatively low number of templates (˜1,000 copies), highly engineered non-natural DNA polymerases, and non-natural nucleotides having a cleavable fluorescent dye on the base and a reversible terminator on the 3′ hydroxyl group. The step-wise incorporation of such sterically hindered non-natural nucleotides is slow and inefficient, even with DNA polymerases engineered to work with these nucleotides. This results in a significant fraction of the templates falling out of synchronization in each cycle. Various strategies can be used to improve read length and accuracy. These include the use of a better combination of DNA polymerase and fluorescently-labeled nucleotides with cleavable terminator to improve incorporation efficiency and a resynchronization step (Wu et al. (2007) Proc Natl Acad Sci USA 104, 16462-7).
The Roche/454 platform makes use of natural DNA polymerases and nucleotides. It is based on pioneering pyrosequencing technology (Ronaghi et al. (1998) Science 281, 363, 365), where the sequence is determined by detecting the chemiluminescence signal generated by a cascade of enzymes triggered by the pyrophosphates released upon nucleotide incorporation by a DNA polymerase. Since natural DNA polymerases have intrinsically high fidelity and can synthesize long diverse DNA sequences, including homopolymer stretches (a sequence with more than one base of the same type in tandem), long read lengths and high accuracy can be achieved with this technology. Read length has been improved from about 100 bases to more than 400 bases with high accuracy (Rothberg & Leamon (2008) Nat Biotechnol 26, 1117-24; Mashayekhi & Ronaghi (2007) Anal Biochem 363, 275-87). However, pyrosequencing involves a complex multi-enzyme cascade (polymerase, sulfurylase and luciferase) that is used to generate pyrophosphate and emit a light signal. This results in a reduced detection sensitivity, which in turn necessitates the use of a large number of templates (>1 million) and a large volume of reagents. To limit cross talk due to diffusion, large wells and an expensive high-density CCD camera coupled to an etched fiber optic plate are utilized for real-time signal detection. This limits the scalability and throughput of the system.
Through massive parallelization and miniaturization, the throughput of DNA sequencing has been increased tremendously while the cost of sequencing has been reduced by several orders of magnitude compared to the conventional gel or capillary-based sequencers using the Sanger dideoxy sequencing method. Emerging sequencing platforms seek to increase the throughput and reduce the cost of DNA sequencing even further to give us the so-called $1000 genome sequencing technology (Rothberg, J. M. and Leamon, J. H., Nat Biotechnol, 26; 1117-1124 (2008); Schloss, J. A., Nat Biotechnol, 26:1113-1115 (2008); Shendure, J. and Ji, H., Nat Biotechnol, 26:1135-1145 (2008)).
Despite the recent progress and developments, further improvements are still needed. Sequencing a mammalian-sized genome remains a time-consuming and expensive endeavor, with costs ranging from 1 to 10 million dollars per genome at 7-fold coverage (National Human Genome Research Institute, Revolutionary Genome Sequencing Technologies—The $1000 Genome (R01), available on the World Wide Web at genome.gov/10000368). Applications that require the sequencing of many individual human genomes are not practical without the development of faster and cheaper sequencing technology. The genomic sequences of normal, neoplastic, and malignant cells from a large number of individuals will be needed for comparative genomics and association studies to dissect the genetic basis of cancer, complex traits/diseases, and personalized medicine.
The present invention provides improved methods for sequencing genetic materials, e.g., for medical applications and biomedical research. The disclosed methods can be applied to rapid personalized medicine, genetic diagnosis, pathogen identification, and genome sequencing for any species in the biosphere.