I. Field of the Invention
The present invention relates generally to compositions and methods for DNA sequencing and other types of DNA analysis. More particularly, the invention relates in part to nucleotides and nucleosides with chemically cleavable, photocleavable, enzymatically cleavable, or non-photocleavable groups and methods for their use in a number of DNA sequencing methods and their applications in biomedical research.
II. Description of Related Art
Methods for rapidly sequencing DNA have become needed for analyzing diseases and mutations in the population and developing therapies. Commonly observed forms of human sequence variation are single nucleotide polymorphisms (SNPs), which occur in approximately 1-in-300 to 1-in-1000 base pairs of genomic sequence and structural variants (SVs) including block substitutions, insertion/deletions, inversions, segmental duplications, and copy number variants. Structural variants can accounted for 22% of all variable events and more variant bases than those contributed by SNPs (Levy et al., 2007, which is incorporated herein by reference.). This finding is consistent with that of Scherer, Hurles, and colleagues who analyzed 270 individuals using microarray-based methods (Redon et al. 2006, which is incorporated herein by reference). Building upon the complete sequence of the human genome, efforts are underway to identify the underlying genetic link to common diseases by SNP mapping or direct association. Technology developments focused on rapid, high-throughput, and low cost DNA sequencing would facilitate the understanding and use of genetic information, such as SNPs, in applied medicine.
In general, 10%-to-15% of SNPs will affect protein function by altering specific amino acid residues, will affect the proper processing of genes by changing splicing mechanisms, or will affect the normal level of expression of the gene or protein by varying regulatory mechanisms. SVs may also play an important role in human biology and disease (Iafrate et al., 2004; Sebat et al., 2004; Tuzun et al., 2005; Stranger et al., 2007, which are incorporated herein by reference). It is envisioned that the identification of informative SNPs and SVs will lead to more accurate diagnosis of inherited disease, better prognosis of risk susceptibilities, or identity of sporadic mutations in tissue. One application of an individual's SNP and SV profile would be to significantly delay the onset or progression of disease with prophylactic drug therapies. Moreover, an SNP and SV profile of drug metabolizing genes could be used to prescribe a specific drug regimen to provide safer and more efficacious results. To accomplish these ambitious goals, genome sequencing will move into the resequencing phase with the potential of partial sequencing of a large majority of the population, which would involve sequencing specific regions in parallel, which are distributed throughout the human genome to obtain the SNP and SV profile for a given complex disease.
Sequence variations underlying most common diseases are likely to involve multiple SNPs, SVs, and a number of combinations thereof, which are dispersed throughout associated genes and exist in low frequency. Thus, DNA sequencing technologies that employ strategies for de novo sequencing are more likely to detect and/or discover these rare, widely dispersed variants than technologies targeting only known SNPs.
Traditionally, DNA sequencing has been accomplished by the “Sanger” or “dideoxy” method, which involves the chain termination of DNA synthesis by the incorporation of 2′,3′-dideoxynucleotides (ddNTPs) using DNA polymerase (Sanger et al., 1997, which is incorporated herein by reference). The reaction also includes the natural 2′-deoxynucleotides (dNTPs), which extend the DNA chain by DNA synthesis. Balanced appropriately, competition between chain extension and chain termination results in the generation of a set of nested DNA fragments, which are uniformly distributed over thousands of bases and differ in size as base pair increments. Electrophoresis is used to resolve the nested DNA fragments by their respective size. The ratio of dNTP/ddNTP in the sequencing reaction determines the frequency of chain termination, and hence the distribution of lengths of terminated chains. The fragments are then detected via the prior attachment of four different fluorophores to the four bases of DNA (i.e., A, C, G, and T), which fluoresce their respective colors when irradiated with a suitable laser source. Currently, Sanger sequencing has been the most widely used method for discovery of SNPs by direct PCR sequencing (Gibbs et al., 1989, which is incorporated herein by reference) or genomic sequencing (Hunkapiller et al., 1991; International Human Genome Sequencing Consortium, 2001, which are incorporated herein by reference).
Advantages of next-generation sequencing (NGS) technologies include the ability to produce an enormous volume of data cheaply, in some cases in excess of a hundred million short sequence reads per instrument run. Many of these approaches are commonly referred to as sequencing-by-synthesis (SBS), which does not clearly delineate the different mechanics of sequencing DNA (Metzker, 2005, which is incorporated herein by reference). Here, the DNA polymerase-dependent strategies are classified as cyclic reversible termination (CRT), single nucleotide addition (SNA, e.g., pyrosequencing), and real-time sequencing. An approach whereby DNA polymerase is replaced by DNA ligase is referred to as sequencing-by-ligation (SBL).
There is a great need for developing new sequencing technologies, with potential applications spanning diverse research sectors including comparative genomics and evolution, forensics, epidemiology, and applied medicine for diagnostics and therapeutics. Current sequencing technologies are often too expensive, labor intensive, and time consuming for broad application in human sequence variation studies. Genome center cost is calculated on the basis of dollars per 1,000 Q20 bases and can be generally divided into the categories of instrumentation, personnel, reagents and materials, and overhead expenses. Currently, these centers are operating at less than one dollar per 1,000 Q20 bases with at least 50% of the cost resulting from DNA sequencing instrumentation alone. Developments in novel detection methods, miniaturization in instrumentation, microfluidic separation technologies, and an increase in the number of assays per run will most likely have the biggest impact on reducing cost.