I. Field of the Invention
The present invention relates generally to compositions and methods for DNA sequencing and other types of DNA analysis. More particularly, the invention relates in part to fast 3′-OH unblocked nucleotides and nucleosides with photochemically cleavable groups and methods for their use in a number of DNA sequencing methods, including applications in biomedical research.
II. Description of Related Art
Methods for rapidly sequencing DNA are needed for analyzing diseases and mutations in the population and developing therapies (Metzker, 2010, which is incorporated herein by reference). Commonly observed forms of human sequence variation are single nucleotide polymorphisms (SNPs), which occur in approximately 1-in-300 to 1-in-1000 base pairs of genomic sequence and structural variants (SVs) including block substitutions, insertion/deletions, inversions, segmental duplications, and copy number variants. Structural variants can account for 22% of all variable events and more variant bases than those contributed by SNPs (Levy et al., 2007, which is incorporated herein by reference). This finding is consistent with that of Scherer, Hurles, and colleagues who analyzed 270 individuals using microarray based methods (Redon et al., 2006, which is incorporated herein by reference). Building upon the complete sequence of the human genome, efforts are underway to identify the underlying genetic link to common diseases and cancer by SNP and SV mapping or direct association. Technology developments focused on rapid, high-throughput, and low cost DNA sequencing would facilitate the understanding and use of genetic information, such as SNPs and SVs, in applied medicine.
In general, 10%-to-15% of SNPs will affect protein function by altering specific amino acid residues, will affect the proper processing of genes by changing splicing mechanisms, or will affect the normal level of expression of the gene or protein by varying regulatory mechanisms. SVs may also play an important role in human biology and disease (lafrate et al., 2004; Sebat et al., 2004; Tuzun et al., 2005; Stranger et al., 2007, which are incorporated herein by reference). It is envisioned that the identification of informative SNPs and SVs will lead to more accurate diagnosis of inherited disease, better prognosis of risk susceptibilities, or identity of sporadic mutations in tissue. One application of an individual's SNP and SV profile would be to significantly delay the onset or progression of disease with prophylactic drug therapies. Moreover, an SNP and SV profile of drug metabolizing genes could be used to prescribe a specific drug regimen to provide safer and more efficacious results. To accomplish these ambitious goals, genome sequencing will move into the resequencing phase with the potential of partial sequencing of a large majority of the population, which would involve sequencing specific regions in parallel, which are distributed throughout the human genome to obtain the SNP and SV profile for a given complex disease.
Sequence variations underlying most common diseases are likely to involve multiple SNPs, SVs, and a number of combinations thereof, which are dispersed throughout associated genes and exist in low frequency. Thus, DNA sequencing technologies that employ strategies for de novo sequencing are more likely to detect and/or discover these rare, widely dispersed variants than technologies targeting only known SNPs.
One example how NGS technologies can be applied in the detection of SNPs, SVs, single nucleotide variants (SNVs) and a number of combinations thereof is cancer diagnostics. These assays have traditionally been a single-marker, single-assay approach that has recently progressed to assaying multiple markers with a single experimental approach. However, each cancer is genetically complex often with many mutations occurring simultaneously in numerous genes. Therefore, traditional methods lead to expensive and time-consuming testing, while providing information only on a select few of known sequence variants. Recent advances in NGS technologies have allowed targeted approaches that center on many medically actionable gene targets associated with various cancer types (See Su et al., 2011; Beadling et al. 2012). Due to recent successes of sequencing efforts, such as The Cancer Genome Atlas (TCGA) project, the International Cancer Genome Consortium (ICGC) project, and the Catalogue of Somatic Mutations in Cancer (COSMIC) database, there is a large compendium of knowledge regarding these gene targets in many cancer types and the result of therapeutics on cancers containing those mutations (See Futreal et al., 2004). Additional work, in part as a result of the Pediatric Cancer Genome Project, has shown that pediatric cancers have distinct genetic profiles marked by a fewer number of mutations and a prevalence of mutations in alternative molecular pathways (See, Wu et al., 2012; Meldrum et al. 2011). The largest current unmet need in cancer diagnostics is a fast, high-throughput technology with the needed accuracy and sensitivity for early-stage detection to identify rare sequence variants that belong to a limited subpopulation of cells undergoing a cancerous transformation.
Traditionally, DNA sequencing has been accomplished by the “Sanger” or “dideoxy” method, which involves the chain termination of DNA synthesis by the incorporation of 2′,3′-dideoxynucleotides (ddNTPs) using DNA polymerase (Metzker et al., 2005, which is incorporated herein by reference). Since 2005, there has been a fundamental shift away from the application of automated Sanger sequencing for genome analysis. Advantages of next-generation sequencing (NGS) technologies include the ability to produce an enormous volume of data cheaply, in some cases in excess of a hundred million short sequence reads per instrument run. Many of these approaches are commonly referred to as sequencing-by-synthesis (SBS), which does not clearly delineate the different mechanics of sequencing DNA (Metzker, 2010; Metzker 2005, which are incorporated herein by reference). DNA polymerase-dependent strategies have been classified as cyclic reversible termination (CRT), single nucleotide addition (SNA, e.g., pyrosequencing), and real-time sequencing. An approach whereby DNA polymerase is replaced by DNA ligase is referred to as sequencing-by-ligation (SBL). These approaches have been described in Metzker (2010), which is incorporated herein by reference.
Sequencing technologies include a number of methods that are grouped broadly as (a) template preparation, (b) sequencing and imaging, and (c) data analysis. The unique combination of specific protocols distinguishes one technology from another and determines the type of data produced from each platform. These differences in data output present challenges when comparing platforms based on data quality and cost. Although quality scores and accuracy estimates are provided by each manufacturer, there is no consensus that a ‘quality base’ from one platform is equivalent to that from another platform.
Two methods used in preparing templates for NGS reactions include: clonally amplified templates originating from single DNA molecules and single DNA molecule templates. Sequencing methods that use DNA polymerases are classified as cyclic reversible termination (CRT), single-nucleotide addition (SNA) and real-time sequencing, (See Metzker 2010). Sequencing by ligation (SBL), an approach in which DNA polymerase is replaced by DNA ligase, has also been used in the NGS technologies, (see, e.g., Shendure et al., 2005; Valouev et al., 2008). Imaging methods coupled with these sequencing strategies range from measuring bioluminescent signals to four-color imaging of single molecular events. The voluminous data produced by these NGS platforms place substantial demands on information technology in terms of data storage, tracking and quality control (see Pop & Salzberg, 2008).
The need for robust methods that produce a representative, non-biased source of nucleic acid material from the genome under investigation remains an important goal. Current methods generally involve randomly breaking genomic DNA into smaller sizes from which either fragment templates or mate-pair templates are created. A common theme among NGS technologies is that the template is attached or immobilized to a solid surface or support. The immobilization of spatially separated template sites allows thousands to billions of sequencing reactions to be performed simultaneously.
Although clonally amplified methods offer certain advantages over bacterial cloning, some of the protocols are typically cumbersome to implement and require a large amount of genomic DNA material (3-20 μg). The preparation of single-molecule templates is more straightforward and requires less starting material (<1 μg). Moreover, these methods do not require PCR, which creates mutations in clonally amplified templates that masquerade as sequence variants. AT-rich and GC-rich target sequences may also show amplification bias in product yield, which results in their underrepresentation in genome alignments and assemblies. Quantitative applications, such as RNA-seq (See Wang et al., 2009), perform more effectively with non-amplified template sources, which do not alter the representational abundance of mRNA molecules.
An important aspect of the CRT method is the reversible terminator, of which there are two main types: 3′-O-blocked and 3′-OH unblocked (Metzker, 2010). The use of a ddNTP, which acts as a chain terminator in Sanger sequencing, provided the basis for the initial development of reversible blocking groups attached to the 3′-end of nucleotides (Metzker et al. 1994; Canard & Sarfati, 1994). Blocking groups such as 3′-O-allyl-dNTPs (Metzker et al., 1994; U.S. Pat. No. 6,664,079; Ju et al., 2006; U.S. Pat. No. 7,057,026; U.S. Pat. No. 7,345,159; U.S. Pat. No. 7,635,578; U.S. Pat. No. 7,713,698) and 3′-O-azidomethyl-dNTPs (U.S. Pat. No. 7,057,026; Guo et al., 2008; Bentley et al., 2008; U.S. Pat. No. 7,414,116; U.S. Pat. No. 7,541,444; U.S. Pat. No. 7,592,435; U.S. Pat. No. 7,556,537; U.S. Pat. No. 7,771,973) have been used in CRT. 3′-O-Blocked terminators require the cleavage of two chemical bonds to remove the fluorophore from the nucleobase and restore the 3′-OH group. A drawback in using these reversible terminators is that the blocking group attached to the 3′-end typically causes a bias against incorporation with DNA polymerase. Mutagenesis of DNA polymerase is often required to facilitate incorporation of 3′-O-blocked terminators. Large numbers of genetically engineered DNA polymerases have to be created by either site-directed or random mutagenesis containing one or more amino acid substitutions, insertions, and/or deletions and then identified by high-throughput screening with the goal of incorporating 3′-blocked nucleotides more efficiently.
The difficulty in identifying a modified enzyme that efficiently incorporates 3′-O-blocked terminators by screening large libraries of mutant DNA polymerases has led to the development of 3′-unblocked reversible terminators. It was demonstrated that a small photocleavable group attached to the base of a 3′-OH unblocked nucleotide can act as an effective reversible terminator and be efficiently incorporated by wild-type DNA polymerases (Wu et al., 2007; Metzker, 2010; Litosh et al., 2011, Gardner et al., 2012; U.S. Pat. Nos. 7,897,737, 7,964,352; and 8,148,503, U.S. Patent Appl. Publication 2011/0287427). For example, 5-hydroxymethyl-2′-deoxyuridine (HOMedU) is found naturally in the genomes of numerous bacteriophages and lower eukaryotes (Gommers-Ampt, 1995, which is incorporated herein by reference). Its hydroxymethyl group can serve as molecular handle to attach a small photocleavable terminating group. Other naturally occurring hypermodified bases that can be further modified to function as reversible terminators include 5-hydroxymethyl-2′-deoxycytidine (HOMedC), which is found naturally in the genomes of T2, T4, and T6 bacteriophages (Wyatt & Cohen, 1953; Gommers-Ampt, 1995) and of mammals (Kriaucionis & Heintz, 2009; Tahiliani et al., 2009; Ito et al., 2010). The pyrrolopyrimidine ring structure (7-deazapurine) is also found naturally in nucleoside antibiotics (Carrasco & Vázquez, 1984, which is incorporated herein by reference) and tRNA bases (Limbach, et al., 1994, which is incorporated herein by reference), and the compounds 7-deaza-7-hydroxymethyl-2′-deoxyadenosine (C7-HOMedA) (Rockhill et al., 1997) and 7-deaza-7-hydroxymethyl-2′-deoxyguanosine (C7-HOMedG) (McDougall et al., 2001) have been reported.
One aspect of the present invention is the use of a modified 2-nitrobenzyl group attached to the nucleobase of hydroxymethyl nucleoside and nucleotides. Described over a half century ago, solutions of 2-nitrotoluene (Wettermark, 1962) and its derivatives (Wettermark, 1962; Hardwick et al., 1960; Mosher et al., 1960; Sousa & Weinstein, 1962; Weinstein et al., 1966) were reported to exhibit the property of photochromism, a phenomenon considered to be the result of transient formation of an aci-nitro anion intermediate (Weinstein et al., 1966; Morrison, 1969). Without being bound by theory, it is generally accepted that absorption of a photon by the nitro group results in hydrogen abstraction from the α-carbon (Mosher et al., 1960; Berson & Brown, 1955; De Mayo, 1960), formation of the aci-nitro anion intermediate, and then release of the ‘caged’ effector molecule and creation of a nitrosocarbonyl by-product (Corrie, 2005). These early studies suggested that α-substitution of the benzylic carbon (Wettermark, 1962) or substitution of the 4-position of the benzene ring with an electron-donating group (Sousa & Weinstein, 1962; Weinstein et al, 1966) increased the rate of the photochromic effect. These findings led to the development of photosensitive 2-nitrobenzyl protecting groups (Barltrop et al., 1966; Patchornik, 1968; Patchornik et al., 1970). The degree to which the rate of photochemical cleavage is altered, however, typically depends on numerous factors that are reported to include substitution of the benzylic carbon (Walker et al., 1986; Hasan et al., 1997; Giegrich et al., 1998), functional group(s) attached to the benzyl ring (Wootton & Trentham, 1989; Hasan et al., 1997; Giegrich et al., 1998), and the leaving group (Walker et al., 1986) as well as pH (McCray et al., 1980; Walker et al., 1986; Wootton & Trentham, 1989), solvent (Sousa & Weinstein, 1962; McGall et al., 1997; Giegrich et al., 1998), and light intensity (McCray et al., 1980; McGall et al., 1997). One property, however, that has not been studied is stereochemistry, whereby, substitution of 2-nitrobenzyl's benzylic or α-carbon results in a chiral center. For the case of nucleotide synthesis, coupling of a racemic α-substituted 2-nitrobenzyl alcohol would result in two diastereomers, which differ only by the absolute configuration (R or S) at the benzylic carbon.
Another class of 3′-OH unblocked nucleotides has been described by Mitra et al. (2003) and Turcatti et al. (2008), which rely on steric hindrance of the bulky dye group to stop incorporation after the addition of the first nucleotide. It is noted that the substituted 2-nitrobenzyl nucleotide analogs described by Wu et al. (2007), Litosh et al. (2011), and Gardner et al., 2012 cause termination of DNA synthesis without the requirement of bulky substituents such as fluorescent dyes. A further class of 3′-unblocked nucleotides has been described by Helicos Biosciences. These nucleotides use a second nucleoside or nucleotide analog that acts as an inhibitor of DNA synthesis (Bowers et al., 2009; U.S. Pat. No. 7,476,734). A significant difference in termination properties is observed when comparing compounds of the present invention with those described by Bowers. For example, Bowers et al. described pre-steady-state kinetics employing two-base homopolymer templates, for which kpol(+2) rates were measured for all of their 3′-OH unblocked ‘virtual’ terminators. Bowers et al. conducted their termination experiments at submicromolar nucleotide concentrations (i.e., from 100 to 250 nM), termination assays. In contrast, several compounds of the present invention were performed at 10 μM over the time course of 0.5 to 20 min. Both compounds dU.V and dU.VI were rapidly incorporated at the first base position (100% by 2 min) and then terminated DNA synthesis at that position. No appreciable signal could be detected at the expected second-base position up to incubation times of 20 min. See Gardner et al., 2012 for more details.
3′-OH unblocked reversible terminators typically have several advantages over 3′-O-blocked nucleotides. For example, for many 3′-OH unblocked reversible terminators the cleavage of only a single bond removes both the terminating and fluorophore groups from the nucleobase. This in turn results in a more efficient strategy for restoring the nucleotide for the next CRT cycle. A second advantage of 3′-OH unblocked reversible terminators is that many of these compounds show more favorable enzymatic incorporation and, in some cases, can be incorporated as well as a natural nucleotide with wild-type DNA polymerases (Wu et al., 2007; Litosh et al., 2011; Gardner et al., 2012; U.S. Pat. No. 7,897,737; U.S. Pat. No. 7,964,352; U.S. Pat. No. 8,148,503; U.S. Patent Appl. Publication 2011/0287427), although in other cases this efficiency has not been observed (Bowers et al., 2009; U.S. Pat. No. 7,476,734). One challenge for 3′-OH unblocked terminators is creating the appropriate modifications to the base that lead to termination of DNA synthesis after a single base addition. This is important because an unblocked 3′-OH group is the natural substrate for incorporating the next incoming nucleotide.
Next-generation sequencing (NGS) technologies have facilitated important biomedical discoveries, yet chemistry improvements are still needed for a number of reasons, including reduction of error rates, reduction of slow cycle times. To be effective in NGS assays, it is typically desirable for reversible terminators to exhibit a number of ideal properties including, for example, fast kinetics of nucleotide incorporation, single-base termination, high nucleotide selectivity, and/or rapid cleavage of the terminating group. Thus, there is a need for developing new nucleosides and nucleotides that meet these challenges.