1. Field of the Invention
The present invention relates to a single-molecule sequencing apparatus and methods.
More particularly, the present invention relates to a single-molecule sequencing apparatus and methods using tagged polymerizing agents and/or tagged monomers where the tagged polymerizing agent and/or the tagged monomers undergo a change in a detectable property before, during and/or after monomer insertion into a growing polymer chain. The apparatus and methods are ideally-suited for sequencing DNA, RNA, polypeptide, carbohydrate or similar bio-molecular sequences under near real-time or real-time conditions. The present invention also relates to a single-molecule sequencing apparatus and methods using tagged depolymerizing agents and/or tagged depolymerizable polymer where the tagged depolymerizing agent and/or the tagged depolymerizable polymer undergo a change in a detectable property before, during and/or after monomer removal from the depolymerizable polymer chain. The apparatus and methods are ideally-suited for sequencing DNA, RNA, polypeptide, carbohydrate or similar bio-molecular sequences. The present invention also relates to detecting a signal evidencing interactions between the tagged polymerizing agent or depolymerizing agent and a tagged or untagged polymer subunit such as a monomer or collection of monomers, where the detected signal provides information about monomer order. In a preferred embodiment, the methods are carried out in real-time or near real-time.
2. Description of the Related Art
Overview of Conventional DNA Sequencing
The development of methods that allow one to quickly and reliably determine the order of bases or ‘sequence’ in a fragment of DNA is a key technical advance, the importance of which cannot be overstated. Knowledge of DNA sequence enables a greater understanding of the molecular basis of life. DNA sequence information provides scientists with information critical to a wide range of biological processes. The order of bases in DNA specifies the order of bases in RNA, the molecule within the cell that directly encodes the informational content of proteins. DNA sequence information is routinely used to deduce protein sequence information. Base order dictates DNA structure and its function, and provides a molecular program that can specify normal development, manifestation of a genetic disease, or cancer.
Knowledge of DNA sequence and the ability to manipulate these sequences has accelerated development of biotechnology and led to the development of molecular techniques that provide the tools to ask and answer important scientific questions. The polymerase chain reaction (PCR), an important biotechnique that facilitates sequence-specific detection of nucleic acid, relies on sequence information. DNA sequencing methods allow scientists to determine whether a change has been introduced into the DNA, and to assay the effect of the change on the biology of the organism, regardless of the type of organism that is being studied. Ultimately, DNA sequence information may provide a way to uniquely identify individuals.
In order to understand the DNA sequencing process, one must recall several facts about DNA. First, a DNA molecule is comprised of four bases, adenine (A), guanine (G), cytosine (C), and thymine (T). These bases interact with each other in very specific ways through hydrogen bonds, such that A interacts with T, and G interacts with C. These specific interactions between the bases are referred to as base-pairings. In fact, it is these base-pairings (and base stacking interactions) that stabilize double-stranded DNA. The two strands of a DNA molecule occur in an antiparallel orientation, where one strand is positioned in the 5′ to 3′ direction, and the other strand is positioned in the 3′ to 5′ direction. The terms 5′ and 3′ refer to the directionality of the DNA backbone, and are critical to describing the order of the bases. The convention for describing base order in a DNA sequence uses the 5′ to 3′ direction, and is written from left to right. Thus, if one knows the sequence of one DNA strand, the complementary sequence can be deduced.
Sanger DNA Sequencing (Enzymatic Synthesis)
Sanger sequencing is currently the most commonly used method to sequence DNA (Sanger et al., 1977). This method exploits several features of a DNA polymerase: its ability to make an exact copy of a DNA molecule, its directionality of synthesis (5′ to 3′), its requirement of a DNA strand (a ‘primer’) from which to begin synthesis, and its requirement for a 3′ OH at the end of the primer. If a 3′ OH is not available, then the DNA strand cannot be extended by the polymerase. If a dideoxynucleotide (ddNTP; ddATP, ddTTP, ddGTP, ddCTP), a base analogue lacking a 3′ OH, is added into an enzymatic sequencing reaction, it is incorporated into the growing strand by the polymerase. However, once the ddNTP is incorporated, the polymerase is unable to add any additional bases to the end of the strand. Importantly, ddNTPs are incorporated by the polymerase into the DNA strand using the same base incorporation rules that dictate incorporation of natural nucleotides, where A specifies incorporation of T, and G specifies incorporation of C (and vice versa).
Fluorescent DNA Sequencing
A major advance in determining DNA sequence information occurred with the introduction of automated DNA sequencing machines (Smith et al., 1986). The automated sequencer is used to separate sequencing reaction products, detect and collect (via computer) the data from the reactions, and analyze the order of the bases to automatically deduce the base sequence of a DNA fragment. Automated sequencers detect extension products containing a fluorescent tag. Sequence read lengths obtained using an automated sequencer are dependent upon a variety of parameters, but typically range between 500 to 1,000 bases (3-18 hours of data collection). At maximum capacity an automated sequencer can collect data from 96 samples in parallel.
When dye-labeled terminator chemistry is used to detect the sequencing products, base identity is determined by the color of the fluorescent tag attached to the ddNTP. After the reaction is assembled and processed through the appropriate number of cycles (3-12 hours), the extension products are prepared for loading into a single lane on an automated sequencer (unincorporated, dye-labeled ddNTPs are removed and the reaction is concentrated; 1-2 hours). An advantage of dye-terminator chemistry is that extension products are visualized only if they terminate with a dye-labeled ddNTP; prematurely terminated products are not detected. Thus, reduced background noise typically results with this chemistry.
State-of-the-art dye-terminator chemistry uses four energy transfer fluorescent dyes (Rosenblum et al., 1997). These terminators include a fluorescein donor dye (6-FAM) linked to one of four different dichlororhodamine (dRhodamine) acceptor dyes. The d-Rhodamine acceptor dyes associated with the terminators are dichloro[R110], dichloro[R6G], dichloro[TAMRA] or dichloro[ROX], for the G-, A-, T- or C-terminators, respectively. The donor dye (6-FAM) efficiently absorbs energy from the argon ion laser in the automated sequencing machine and transfers that energy to the linked acceptor dye. The linker connecting the donor and acceptor portions of the terminator is optimally spaced to achieve essentially 100% efficient energy transfer. The fluorescence signals emitted from these acceptor dyes exhibit minimal spectral overlap and are collected by an ABI PRISM 377 DNA sequencer using 10 nm virtual filters centered at 540, 570, 595 and 625 nm, for G-, A-, T- or C-terminators, respectively. Thus, energy transfer dye-labeled terminators produce brighter signals and improve spectral resolution. These improvements result in more accurate DNA sequence information.
The predominant enzyme used in automated DNA sequencing reactions is a genetically engineered form of DNA polymerase I from Thermus aquaticus. This enzyme, AmpliTaq DNA Polymerase, FS, was optimized to more efficiently incorporate ddNTPs and to eliminate the 3′ to 5′ and 5′ to 3′ exonuclease activities. Replacing a naturally occurring phenylalanine at position 667 in T. aquaticus DNA polymerase with a tyrosine reduced the preferential incorporation of a dNTP, relative to a ddNTP (Tabor and Richardson, 1995; Reeve and Fuller, 1995). Thus, a single hydroxyl group within the polymerase is responsible for discrimination between dNTPs and ddNTPs. The 3′ to 5′ exonuclease activity, which enables the polymerase to remove a mis-incorporated base from the newly replicated DNA strand (proofreading activity), was eliminated because it also allows the polymerase to remove an incorporated ddNTP. The 5′ to 3′ exonuclease activity was eliminated because it removes bases from the 5′ end of the reaction products. Since the reaction products are size separated during gel electrophoresis, interpretable sequence data is only obtained if the reaction products share a common endpoint. More specifically, the primer defines the 5′ end of the extension product and the incorporated, color-coded ddNTP defines base identity at the 3′ end of the molecule. Thus, conventional DNA sequencing involves analysis of a population of DNA molecules sharing the same 5′ endpoint, but differing in the location of the ddNTP at the 3′ end of the DNA chain.
Genome Sequencing
Very often a researcher needs to determine the sequence of a DNA fragment that is larger than the 500-1,000 base average sequencing read length. Not surprisingly, strategies to accomplish this have been developed. These strategies are divided into two major classes, random or directed, and strategy choice is influenced by the size of the fragment to be sequenced.
In random or shotgun DNA sequencing, a large DNA fragment (typically one larger than 20,000 base pairs) is broken into smaller fragments that are inserted into a cloning vector. It is assumed that the sum of information contained within these smaller clones is equivalent to that contained within the original DNA fragment. Numerous smaller clones are randomly selected, DNA templates are prepared for sequencing reactions, and primers that will base-pair with the vector DNA sequence bordering the insert are used to begin the sequencing reaction (2-7 days for a 20 kbp insert). Subsequently, the quality of each base call is examined (manually or automatically via software (PHRED, Ewing et al., 1998); 1-10 minutes per sequence reaction), and the sequence of the original DNA fragment is reconstructed by computer assembly of the sequences obtained from the smaller DNA fragments. Based on the time estimates provided, if a shotgun sequencing strategy is used, a 20 kbp insert is expected to be completed in 3-10 days. This strategy was extensively used to determine the sequence of ordered fragments that represent the entire human genome (http://www.nhgri.nih.gov/HGP/). However, this random approach is typically not sufficient to complete sequence determination, since gaps in the sequence often remain after computer assembly. A directed strategy (described below) is usually used to complete the sequence project.
A directed or primer-walking sequencing strategy can be used to fill-in gaps remaining after the random phase of large-fragment sequencing, and as an efficient approach for sequencing smaller DNA fragments. This strategy uses DNA primers that anneal to the template at a single site and act as a start site for chain elongation. This approach requires knowledge of some sequence information to design the primer. The sequence obtained from the first reaction is used to design the primer for the next reaction and these steps are repeated until the complete sequence is determined. Thus, a primer-based strategy involves repeated sequencing steps from known into unknown DNA regions, the process minimizes redundancy, and it does not require additional cloning steps. However, this strategy requires the-synthesis of a new primer for each round of sequencing.
The necessity of designing and synthesizing new primers, coupled with the expense and the time required for their synthesis, has limited the routine application of primer-walking for sequencing large DNA fragments. Researchers have proposed using a library of short primers to eliminate the requirement for custom primer synthesis (Studier, 1989; Siemieniak and Slightom, 1990; Kieleczawa et al., 1992; Kotler et al., 1993; Burbelo and Iadarola, 1994; Hardin et al., 1996; Raja et al., 1997; Jones and Hardin, 1998a,b; Ball et al., 1998; Mei and Hardin, 2000; Kraltcheva and Hardin, 2001). The availability of a primer library minimizes primer waste, since each primer is used to prime multiple reactions, and allows immediate access to the next sequencing primer.
One of the original goals of the Human Genome Project was to complete sequence determination of the entire human genome by 2005 (http://www.nhgri.nih.gov/HGP/). However, the plan is ahead of schedule and a ‘working draft’ of the human genome was published in February 2001 (Venter et al., 2001, “International Human Genome Sequencing Consortium 2001”). Due to technological advances in several disciplines, the completed genome sequence is expected in 2003, two years ahead of schedule. Progress in all aspects involving DNA manipulation (especially manipulation and propagation of large DNA fragments), evolution of faster and better DNA sequencing methods (http://www.abrf.org), development of computer hardware and software capable of manipulating and analyzing the data (bioinformatics), and automation of procedures associated with generating and analyzing DNA sequences (engineering) are responsible for this accelerated time frame.
Single-Molecule DNA Sequencing
Conventional DNA sequencing strategies and methods are reliable, but time, labor, and cost intensive. To address these issues, some researchers are investigating fluorescence-based, single-molecule sequencing methods that use enzymatic degradation, followed by single-dNMP detection and identification. The DNA polymer containing fluorescently-labeled nucleotides is digested by an exonuclease, and the labeled nucleotides are detected and identified by flow cytometry (Davis et al., 1991; Davis et al., 1992; Goodwin et al., 1997; Keller et al., 1996; Sauer et al., 1999; Werner et al., 1999). This method requires that the DNA strand is synthesized to contain the flourescently-labeled base(s). This requirement limits the length of sequence that can be determined, and increases the number of manipulations that must be performed before any sequence data is obtained. A related approach proposes to sequentially separate single (unlabeled) nucleotides from a strand of DNA, confine them in their original order in a solid matrix, and detect the spectroscopic emission of the separated nucleotides to reconstruct DNA sequence information (Ulmer, 1997; Mitsis and Kwagh, 1999; Dapprich, 1999). This is the approach that is being developed by Praelux, Inc., a company with a goal to develop single-molecule DNA sequencing. Theoretically, this latter method should not be as susceptible to length limitations as the former enzymatic degradation method, but it does require numerous manipulations before any sequence information can be obtained.
Li-cor, Inc. is developing an enzyme synthesis based strategy for single-molecule sequencing as set forth in PCT application WO 00/36151. The Li-cor method involves multiply modifying each dNTP by attaching a fluorescent tag to the γ-phosphate and a quenching moiety to the another site on the dNTP, preferably on the base. The quenching moiety is added to prevent emission from the fluorescent tag attached to an unincorporated dNTP. Upon incorporation the fluorescent tag and quenching moiety are separated, resulting in emission from the tag. The tag (contained on the pyrophosphate) flows away from the polymerase active site, but the modified (quenched) base becomes part of the DNA polymer.
Although some single-molecular sequencing systems have been disclosed, many of them anticipate or require base modification. See, e.g., Patent Application Serial Numbers WO 01/16375 A2, WO 01/23610 A2, WO 01/25480, WO 00/06770, WO 99/05315, WO 00/60114, WO 00/36151, WO 00/36512, and WO 00/70073, incorporated herein by reference. Base modifications may distort DNA structure (which normally consists of A-form DNA nearest the enzyme active site; Li et al., 1998a). Since the dNTP and approximately 7 of the 3′-nearest bases in the newly synthesized strand contact internal regions of the polymerase (Li et al., 1998a), the A-form DNA may be important for maximizing minor groove contacts between the enzyme and the DNA. If the DNA structure is affected due to base modification, enzyme fidelity and/or function may be altered. Thus, there is still a need in the art for a fast and efficient enzymatic DNA sequencing system for single molecular DNA sequences.