Methods for rapidly sequencing DNA have become needed for analyzing diseases and mutations in the population and developing therapies. The most commonly observed form of human sequence variation is single nucleotide polymorphisms (SNPs), which occur in approximately 1-in-300 to 1-in-1000 base pairs of genomic sequence. Building upon the complete sequence of the human genome, efforts are underway to identify the underlying genetic link to common diseases by SNP mapping or direct association. Technology developments focused on rapid, high-throughput, and low cost DNA sequencing would facilitate the understanding and use of genetic information, such as SNPs, in applied medicine.
In general, 10%-to-15% of SNPs will affect protein function by altering specific amino acid residues, will affect the proper processing of genes by changing splicing mechanisms, or will affect the normal level of expression of the gene or protein by varying regulatory mechanisms. It is envisioned that the identification of informative SNPs will lead to more accurate diagnosis of inherited disease, better prognosis of risk susceptibilities, or identity of sporadic mutations in tissue. One application of an individual's SNP profile would be to significantly delay the onset or progression of disease with prophylactic drug therapies. Moreover, an SNP profile of drug metabolizing genes could be used to prescribe a specific drug regimen to provide safer and more efficacious results. To accomplish these ambitious goals, genome sequencing will move into the resequencing phase with the potential of partial sequencing of a large majority of the population, which would involve sequencing specific regions or single base pairs in parallel, which are distributed throughout the human genome to obtain the SNP profile for a given complex disease.
Sequence variations underlying most common diseases are likely to involve multiple SNPs, which are dispersed throughout associated genes and exist in low frequency. Thus, DNA sequencing technologies that employ strategies for de novo sequencing are more likely to detect and/or discover these rare, widely dispersed variants than technologies targeting only known SNPs.
Traditionally, DNA sequencing has been accomplished by the “Sanger” or “dideoxy” method, which involves the chain termination of DNA synthesis by the incorporation of 2′,3′-dideoxynucleotides (ddNTPs) using DNA polymerase (Sanger, F., Nicklen, S., and Coulson, A. R. (1977) DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. USA 74, 5463-5467). The reaction also includes the natural 2′-deoxynucleotides (dNTPs), which extend the DNA chain by DNA synthesis. Balanced appropriately, competition between chain extension and chain termination results in the generation of a set of nested DNA fragments, which are uniformly distributed over thousands of bases and differ in size as base pair increments. Electrophoresis is used to resolve the nested DNA fragments by their respective size. The ratio of dNTP/ddNTP in the sequencing reaction determines the frequency of chain termination, and hence the distribution of lengths of terminated chains. The fragments are then detected via the prior attachment of four different fluorophores to the four bases of DNA (i.e., A, C, G, and T), which fluoresce their respective colors when irradiated with a suitable laser source. Currently, Sanger sequencing has been the most widely used method for discovery of SNPs by direct PCR sequencing (Gibbs, R. A., Nguyen, P.-N., McBride, L. J., Koepf, S. M., and Caskey, C. T. (1989) Identification of mutations leading to the Lesch-Nyhan syndrome by automated direct DNA sequencing of in vitro amplified cDNA. Proc. Natl. Acad. Sci. USA 86, 1919-1923) or genomic sequencing (Hunkapiller, T., Kaiser, R. J., Koop, B. F., and Hood, L. (1991) Large-scale and automated DNA sequencing Determination. Science 254, 59-67; International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. (2001) Nature 409, 860-921).
Another promising sequencing approach is cyclic reversible termination (CRT), which is a cyclic method of detecting the synchronistic, single base additions of multiple templates. This approach differentiates itself from the Sanger method (Metzker, M. L. (2005) Genome Res. 15, 1767-1776) in that it can be performed without the need for gel electrophoresis, a major bottleneck in advancing this field. Like Sanger sequencing, however, longer read-lengths translates into fewer sequencing assays needed to cover the entire genome.
It has remained difficult to accomplish the goal of long CRT reads because reversible terminators typically act as poor substrates with commercially available DNA polymerases. Reversible terminators are structured with a 3′-O-blocking group and a nucleobase attached fluorescent dye via a linking group. Both blocking and dye groups require removal prior to subsequent base additions. These nucleotide modifications are not well tolerated by DNA polymerases, which can be mutated by numerous strategies to improve enzymatic performance. Upon deprotection, the nucleobase linker group is left behind, accumulating sequentially in the growing DNA duplex with subsequent CRT cycles. It is believed that poor enzyme kinetics and a sequentially modified DNA duplex limit longer read-lengths. The present invention describes novel, reversible nucleotide structures that require a single attachment of both terminating and fluorescent dye moieties, improving enzyme kinetics as well as deprotection efficiencies. These reversible terminators are incorporated efficiently by a number of commercially available DNA polymerases, with the deprotection step transforming the growing DNA duplex into its natural state.
DNA sequencing read-lengths of CRT technologies are governed by the overall efficiency of each nucleotide addition cycle. For example, if one considers the end-point of 50% of the original starting material as having an acceptable signal-to-noise ratio, the following equation can be applied to estimate the effect of the cycle's efficiency on read-length: (RL)Ceff=0.5, where RL is the read-length in bases and Ceff is the overall cycle efficiency. In other words, a read-length of 7 bases could be achieved with an overall cycle efficiency of 90%, 70 bases could be achieved with a cycle efficiency of 99% and 700 bases with a cycle efficiency of 99.9%. To achieve the goal of sequencing large stretches, the method must provide very high cycle efficiency or the recovery may fall below acceptable signal to noise ratios. Reversible terminators that exhibit higher incorporation and deprotection efficiencies will achieve higher cycle efficiencies, and thus longer read-lengths.
For CRT terminators to function properly, the protecting group must be efficiently cleaved under mild conditions. The removal of a protecting group generally involves either treatment with strong acid or base, catalytic or chemical reduction, or a combination of these methods. These conditions may be reactive to the DNA polymerase, nucleotides, oligonucleotide-primed template, or the solid support creating undesirable outcomes. The use of photochemical protecting groups is an attractive alternative to rigorous chemical treatment and can be employed in a non-invasive manner.
A number of photoremovable protecting groups including 2-nitrobenzyl, benzyloxycarbonyl, 3-nitrophenyl, phenacyl, 3,5-dimethoxybenzoinyl, 2,4-dinitrobenzenesulphenyl, and their respective derivatives have been used for the syntheses of peptides, polysaccharides, and nucleotides (Pillai, V. N. R. (1980) Photoremovable Protecting Groups in Organic Synthesis. Synthesis, 1-26). Of these, the light sensitive 2-nitrobenzyl protecting group has been successfully applied to the 2′-OH of ribonucleosides for diribonucleoside synthesis (Ohtsuka, E., Tanaka, S., and Ikehara, M. (1974) Studies on transfer ribonucleic acids and related compounds. IX(1) Ribooligonucleotide synthesis using a photosensitive o-nitrobenzyl protection at the 2′-hydroxyl group. Nucleic Acids Res. 1, 1351-1357), the 2′-OH of ribophosphoramidites in automated ribozyme synthesis (Chaulk, S. G., and MacMillan, A. M. (1998) Caged RNA: photo-control of a ribozyme. Nucleic Acids Res. 26, 3173-3178), the 3′-OH of phosphoramidites for oligonucleotide synthesis in the Affymetrix chemistry (Pease, A. C., Solas, D., Sullivan, E. J., Cronin, M. T., Holmes, C. P., and Fodor, S. P. A. (1994) Light-generated oligonucleotide arrays for rapid DNA sequence analysis. Proc. Natl. Acad. Sci. USA 91, 5022-5026), and to the 3′-OH group for DNA sequencing applications (Metzker, M. L., Raghavachari, R., Richards, S., Jacutin, S. E., Civitello, A., Burgess, K., and Gibbs, R. A. (1994) Termination of DNA synthesis by novel 3′-modified deoxyribonucleoside triphosphates. Nucleic Acids Res. 22, 4259-4267). Under deprotection conditions (ultraviolet light>300 nm), the 2-nitrobenzyl group can be efficiently cleaved without affecting either the pyrimidine or purine bases (Pease, A. C., Solas, D., Sullivan, E. J., Cronin, M. T., Holmes, C. P., and Fodor, S. P. A. (1994) Light-generated oligonucleotide arrays for rapid DNA sequence analysis. Proc. Natl. Acad. Sci. USA 91, 5022-5026) and (Bartholomew, D. G., and Broom, A. D. (1975) One-step Chemical Synthesis of Ribonucleosides bearing a Photolabile Ether Protecting Group. J. Chem. Soc. Chem. Commun., 38).
The need for developing new sequencing technologies has never been greater than today with applications spanning diverse research sectors including comparative genomics and evolution, forensics, epidemiology, and applied medicine for diagnostics and therapeutics. Current sequencing technologies are too expensive, labor intensive, and time consuming for broad application in human sequence variation studies. Genome center cost is calculated on the basis of dollars per 1,000 Q20 bases and can be generally divided into the categories of instrumentation, personnel, reagents and materials, and overhead expenses. Currently, these centers are operating at less than one dollar per 1,000 Q20 bases with at least 50% of the cost resulting from DNA sequencing instrumentation alone. Developments in novel detection methods, miniaturization in instrumentation, microfluidic separation technologies, and an increase in the number of assays per run will most likely have the biggest impact on reducing cost.
It is therefore an object of the invention to provide novel compounds that are useful in efficient sequencing of genomic information in high throughput sequencing reactions.
It is another object of the invention to provide novel reagents and combinations of reagents that can efficiently and affordably provide genomic information.
It is yet another object of the invention to provide libraries and arrays of reagents for diagnostic methods and for developing targeted therapeutics for individuals.