A number of different approaches for sequencing nucleic acids exist. The traditional methods are the dideoxy-chain termination method described by Sanger et al., Proc Natl. Acad. Sci. USA, (1977) 74: 5463-67 and the chemical degradation method described by Maxam et al., Proc. Natl. Acad. Sci. USA, (1977) 74: 560-564. Of these two methods, the Sanger procedure has been the most widely used. The original Sanger method relied on radioactive labeling of the reaction products and separation of the reaction products by slab gel electrophoresis.
Both the Sanger and Maxam methods are time and labor intensive. The start of the Human Genome Project was the impetus for the development of improved, automated systems to perform Sanger sequencing. As a result, detection of fluorescence has replaced autoradiography and capillary electrophoresis has replaced the ultrathin slab gels originally used to separate reaction products. Automated sequencers have been developed and are capable of processing large numbers of samples without operator intervention.
The completion of the Human Genome Project has refocused the community on the need for new technologies that are capable of rapidly and inexpensively determining the sequence of human genomes. There is has been much discussion in recent years about personalized medicine. The vision of personalized medicine involves every individual having his or her complete genome sequenced at high accuracy and using this information to guide clinical care, specifically for risk stratification of patients and pharmacogenomics.
In recent years, a number of technological advances have been developed enabling a great reduction in the cost of sequencing and substantially increasing the amount of sequence data produced. All of the sequencing methods currently available utilize optical detection for the determination of the DNA sequence. The most prevalent sequencing methods are referred to as sequencing by synthesis (SBS).
Typical embodiments of SBS consist of the stepwise synthesis of a strand of DNA that is complementary to a template sequence from the target genome to be sequenced. The SBS methods can be divided into those that are performed in batch mode and those that are performed in real-time. The batch mode processes rely on the stepwise synthesis of the new DNA strand with the limitation that the synthesis is only allowed to proceed for one nucleotide position, for one nucleotide type, or for the combination of one nucleotide position and one nucleotide type. The incorporation of the nucleotide occurs in parallel for large numbers of templates and are detected using a variety of methods.
Embodiments of the batch mode utilizing a single nucleotide type are used by Roche for pyrosequencing with the 454 platform. The Roche technology (see, e,g., Margulies et al. (2005) Nature, 437:376-380; U.S. Pat. Nos. 6,274,320; 6,258,568; 6,210,891) utilizes pyrosequencing. The method depends on several enzymes and cofactors to produce luminescence when a nucleotide is incorporated. A single nucleotide species is introduced into a large number of small reaction vessels each containing multiple copies of a single template. The incorporation of the nucleotide are accompanied by light emission. When the reaction has run to completion, the reagents are washed from the reaction volumes and a next nucleotide and required reagents is washed into the reactions. Each template is thus extended in an iterative fashion, one nucleotide at a time. Multiple incorporations of the same nucleotide require the quantitative determination of the amount of light emitted. Homopolymer tracts in templates may be difficult to accurately sequence as the incremental amount of light emitted for each subsequent position in the homopolymer becomes small compared to the total amount emitted.
In a second embodiment of the SBS method, platforms by Helicos (see, e.g., Quake et al Proc. Nat. Acad. Sci. USA (2003) 100: 3960-3964. U.S. Pat. Nos. 6,818,395; 6,911,345; 7,297,518; 7,462,449 and 7,501,245), Illumina (see, e.g., Bennett et al. Pharmacogenomics (2005) 6:373-382), and Intelligent Bio-Systems (see, e.g., Ju et al. Proc. Nat. Acad. Sci. USA (2006) 103:19635-19640) allow only the incorporation of a single nucleotide at each step. Template strands are attached to a solid support and a primer sequence is annealed. A polymerase used to extend the primer to make a complement to the template. The nucleotides are derivatized such that after the incorporation of a single nucleotide, the growing strand is incapable of further extension. The nucleotides are further derivatized to make them fluorescent. In the Helicos technology, the four nucleotides are labeled with the same fluorescent tag. This requires that each nucleotide type be added separately. In contrast, the Illumina and Intelligent Bio-Systems technologies utilize four different fluorescent tags so that a mixture of all four derivatized nucleotides may be added at the same time. For both technologies, the incorporation of a nucleotide is accompanied by the appearance of fluorescence in the growing strand. In the case of Illumina, the wavelength of the fluorescence emission indicates the identity of the newly incorporated nucleotide. In the Helicos technology, only a single nucleotide type is added at each cycle. Thus, the appearance of fluorescence at a position on the solid support indicates the incorporation of the added nucleotide for that template. Templates that do not incorporate the nucleotide present in the reaction remain dark.
Following the observation of any incorporated fluorescence, the blocking groups and fluorescent tags are removed prior to the next cycle. Multiple cycles result in the acquisition of sequence data for many templates in a single run. The instrumentation typical for these technologies allows for the automated acquisition of sequence information for hundreds of thousands to millions of templates in parallel.
SBS methods may also be performed in real-time. In this embodiment, polymerase is used to incorporate fluorescently labeled nucleotides and the fluorescence is observed during DNA strand synthesis. The four nucleotides are labeled with different fluorescent tags. The fluorescent tags are attached to the terminal phosphate of the nucleotide triphosphate. During incorporation of the nucleotide into the growing strand the fluorophore is released to solution and the growing strand remains non-fluorescent. The identity of the incorporated strand is determined while the nucleotide resides in the active site of the enzyme and before the cleaved diphosphate is released to bulk solution.
The fluorescence of the incorporated nucleotide typically is measured in a background fluorescence from a much larger concentration of unincorporated nucleotide. Pacific Biosystems (see, e.g., U.S. Pat. Nos. 7,170,050; 7,302,146; 7,315,019; 7,476,503; and 7,476,504) identifies the incorporated nucleotide based on the residence time in the polymerase active site. Fluorescence emission from the active site for an appropriate time indicates incorporation and the emission wavelength determines the identity of the incorporated nucleotide. Polymerase is attached to the bottom of zero-mode waveguides. Zero-mode waveguides are reaction cells whose dimensions limit the fluorescence excitation to the evanescent wave from the light source. Thus, only fluorescent tags close to the bottom surface of the reaction volume are excited.
Visigen identifies the incorporated nucleotide through Fluorescent Resonant Energy Transfer (FRET) between an acceptor in the polymerase active site and a fluorescent tag on the nucleotide (see, e.g., U.S. Pat. Nos. 7,211,414 and 7,329,492). Only nucleotides held in the active site of the polymerase show fluorescence. Incorporation is identified by the residence time of the fluorescence in the active site and the nucleotide identity is determined by the emission wavelength.
Other recently developed methods to sequence DNA rely on hybridization and ligation. Both the SOLiD and Complete Genomics technologies rely on the combination of hybridization and ligation. The SOLiD system (Life Technologies) immobilizes short template strands via an adapter. A primer and a pool of labeled oligonucleotides containing two fixed positions and six degenerate positions is hybridized to the template. The primer hybridizes to the adaptor. The pool consists of 65,536 (4^8) different sequences. Four fluorescent dyes are used to label the oligonucleotides in a fashion that creates four subsets based on the sixteen combinations at the two fixed positions. Thus, each fluorescent tag is associated with 4 of the sixteen possible combinations. Following hybridization, a ligase is added and any probes in the pool that hybridized contiguously with the primer are ligated to the primer. The fluorescence of the hybridized and ligated product is determined. The fluorescence defines which subset of sequences hybridized to the template and ligated to the primer. The terminal three bases and the associated fluorescent tag are cleaved from the hybridized and ligated oligonucleotide. Subsequent rounds of another round of hybridization, ligation, and cleavage are performed. In this first series of reactions, each cycle identifies a subset for the pair of nucleotides in the template that is 5 nucleotides downstream from subset of 4 pairs that were identified in the last cycle. After several cycles, the primer, and the oligonucleotides that have been ligated to it, is washed off the template
The entire procedure is repeated starting with a primer that is one nucleotide shorter than the original primer, then with primers that are two, three, and four nucleotides shorter than the original primer. These subsequent rounds shift the frame of interrogation so that the bases that make-up the template strand can be identified from the union between the two subsets of reaction that overlapped at that position.
Complete Genomics technology utilizes a similar hybridization and ligation method (see, e.g., US Patent Application Publication Nos. 20080234136; 20090005252; 20090011943; and 20090176652). In the Complete Genomics technology, a primer is hybridized to an adaptor that is attached to the end of the template. A series of pools of oligonucleotides is constructed. In each pool, the nucleotide at a single position is identified by using four-color fluorescence. The remaining positions are degenerate. The first pool is hybridized to the template. Oligonucleotides that hybridize adjacent to the primer are subsequently ligated. After washing excess oligonucleotides away, the fluorescence of the ligated oligonucleotide identifies the nucleotide at the defined position in that pool. The ligated primer and oligonucleotide are washed off the template and the process is repeated with the next pool of oligonucleotides that probe the next position down from the primer.
The SBS and hybridization-ligation methods generate short pieces or reads of DNA sequence. While the short reads can be used to re-sequence human genomes, they are not favorable for the de novo assembly of human genomes. With the recent realization that human genomes contain large numbers of inversions, translocations, duplications, and indels (e.g., mutations that include both insertions, deletions, and the combination thereof), the quality of human genome data from short reads is even more suspect. Genetic rearrangements are even more prevalent in cancer.
While embodiments of the short read technologies that incorporate paired-end reads have been proposed and the length of the sequence data from these technologies has increased incrementally over the last two years, it is clear that longer read technologies are necessary for the accurate assembly of human genome data.
In addition to the undesirable nature of short reads, all of the extant DNA sequencing methods employ optical detection. The throughput of optical methods limits the ultimate performance characteristics of any of these sequencing technologies. Optical methods are capable of identifying single molecules. However, the time required to observe and accurately identify events will remain too slow to meet the need for higher throughput. While the current generation of sequencing technologies has lowered the cost of sequencing by orders of magnitude as compared to the methods used to sequence the first human genomes, the methods remain too slow and costly for routine analysis of human genomes.
A need therefore exists for efficient methods and devices capable of rapid and accurate nucleic acid sequencing for de novo assembly of human genomes. It is desirable to have long read lengths and to use as little nucleic acid template as possible. Moreover, single-molecule optical detection of DNA has limitations with respect to sensitivity and speed.
The use of electronic detection applied to DNA sequencing may help overcome the limitations associated with single-molecule detection. For example, Hybridization-Assisted Nanopore Sequencing (HANS), which uses nanopores to detect and locate the position of hybridization events (e.g., hybridized probes on a biopolymer), is expected to yield highly accurate DNA sequence information, with long read lengths. The HANS method relies on detecting the position of hybridized probes on single molecules of the biopolymer to be sequenced or characterized. The resulting positional hybridization data is used to reconstruct sequence information of the target strand. The process for sequence reconstruction is similar to that for reconstructing sequence data from Sequencing by Hybridization (SBH) experiments with the important difference that the addition of positional information removes the inherent mathematical limitations of SBH and results in successful reconstruction of extremely long sequences.
The HANS method provides a number of benefits over other proposed sequencing technologies. For example, the inherent nature of reconstructing data from multiple overlapping hybridization events reduces errors. Further, the rapid nature of the sensing allows for higher accuracy since coverage can be extensive without significantly impacting the timely production of data. Thus, a significant benefit of the HANS approach is the long read lengths obtainable by the method, which may be used to identify genomic rearrangements and/or reconstruct haplotypes from diploid organisms or separate genomes of related mixtures of, for instance, viral or microbial species.
In the HANS method, two reservoirs of solution are separated by a nanometer sized hole, or nanopore, that serves as a fluidic constriction of known dimensions. The application of a constant DC voltage between the two reservoirs results in a baseline ionic current that is measured. If an analyte is introduced into a reservoir, it may pass through the fluidic channel and change the observed current, due to a difference in conductivity between the electrolyte solution and analyte. The magnitude of the change in current depends on the volume of electrolyte displaced by the analyte while it is in the fluidic channel. The duration of the current change is related to the amount of time that the analyte takes to pass through the nanopore constriction.
In the case of DNA translocation through a nanopore, the physical translocation is driven by the electrophoretic force generated by the applied DC voltage between the two reservoirs. This driving force and the detected signal are, typically, inseparably coupled. A higher signal-to-noise ratio may be obtained by using higher voltages, but this may also result in a faster translocation rate of the analyte through the nanopore. The faster translocation rate may reduce the duration of the current change when analyte passes through the pore and thus the current change may be harder to detect because of bandwidth limitations in the current sensing electronics.
DNA can also be translocated through nanochannels by applying a DC voltage See, e.g., Riehn, R. et al. Proc. Nat. Acad. Sci. 2005, 102, 10012, which is incorporated herein by reference in its entirety. Detection of DNA molecules in a nanochannel has been accomplished by applying a current through electrodes that are perpendicular to the nanochannel. See Liang, X.; Chou, S. Y. Nano Lett. 2008, 8, 1472, which is incorporated herein by reference in its entirety. As DNA passes between the electrodes the observed current passing between two electrodes disposed on the opposite side of the channel changes. The length of the DNA strand may be inferred from the time of passage of the strand past the electrodes. However, the spread in the data indicates that there may be significant error in the calculation of the length of the DNA if one only uses the duration of the signal.
As discussed above, the distance between locations of hybridization of sequence selective probes in the HANS method is inferred from the time between translocations of the hybridized portions through the nanopore. While extremely sensitive as single-molecule detectors, solid-state nanopores have a number of inherent limitations for the characterization of DNA strands. Translocation times are rapid through nanopores thus necessitating a tagging scheme for probes. The design of the nanopore generally precludes multiple measurements of a single molecule unless a capture-recapture technique is utilized (Gershow, M.; Golovchenko, J. A. Recapturing and trapping single molecules with a solid-state nanopore. Nature Nanotech 2007, 2, 775-779, which is incorporated by reference in its entirety). The translocation times of DNA fragments, all of the same length, have relatively large distributions. While each of these problems can be resolved for the implementation of HANS with nanopore detectors, it is apparent that improvements in the detector will make HANS development faster and will lead to higher sequencing accuracy and throughput.
Another limitation to solid-state nanopores is the fabrication of devices. Currently, each pore is fabricated by using a transmission electron microscope (TEM). From both a time and cost standpoint, this method would be prohibitive for the construction of large arrays of nanopores.
An alternative detector in a nanochannel utilizes a 4-point sensing element to separate the detection element and the electrophoretic driving elements. This detector also infers the relative positions of hybridized probes from the time between passage of subsequent probes through the detector. However, the difficulty of determining the biopolymer's passage rate lowers the resolution of sequencing data.
Thus, there remains a need for improved devices and methods for sequencing biopolymers.