Provided are methods and devices for characterizing polynucleotides, such as determining nucleotide sequences in double stranded DNA.
DNA, the program for life, is encoded using four chemical bases called adenine (A), guanine (G), cytosine (C) and thymine (T), which are paired together in a complementary fashion (A to T and C to G) and ordered in a species-specific sequence. The aim of genomic science is to predict biological behavior using the information stored in the DNA sequence within each cell. But when the first draft sequence of the human genome emerged in early 2001,6 despite its enormous value to genetics, it quickly became apparent that our understanding of the relationship between the genetic code and cellular function was deficient. For example, only 5% of the human genome is conserved, and of that, only 30% lies within the exons of known protein-encoding genes.7 The rest lies in the so-called “dark matter” of the human genome—leading to efforts such as the ENCyclopedia Of DNA Elements (ENCODE)8 that strive to identify regulatory components. Identifying genes and controlling regions, such as promoter sites turns out to be a major undertaking in itself.
To glean more information about how genetics informs cellular function, and therefore its affect on development and disease, it is essential that we learn to sequence rapidly and economically using a minute amount of material. The tasks can be categorized as follows. On one hand, in terms of de novo sequencing, not only are there many species with unsequenced genomes, but the human microbiome, the genome sequence of the many different species of bacteria living in or on humans, still remains a mystery. The microbiome of the flora of our gut alone is estimated to contain ˜300 billion base-pairs (Gbp), or ˜100× the human genome.9 On the other hand, the vast majority of the work ahead involves re-sequencing genomes with an already known base sequence. The first, obvious example is mutation sequencing where recent work has shown that the majority of human cancers do not always have mutations in the same locations, or even the same genes.10 Moreover, the mutations and genotype of the individual has been shown to be important for chemotherapeutic effectiveness: i.e. genomics can determine a drug's effectiveness on an individual.11
Sequencing can also provide clues to health and development beyond the actual genomic sequence itself. The proteins expressed by genes represent the machinery of the cell—they make things work. But an individual organism can express the same genes differently depending on the epigenetic profile. High-throughput sequencing aspires to determine this profile. For example, it can give information on DNA-binding protein interaction, using ChIP-seq to find the locations of occupied binding sites.12 With inexpensive, high-throughput sequencing, we will be able to determine the difference between these binding sites in different tissue and under different conditions. Moreover, using ChIP-seq we can also achieve single-base resolution of the genomic histone code, one of the epigenetic regulators of chromatin structure and gene expression.13 It can also be used to determine DNA methylation patterns, a reversible modification of cytosines (in mammals), which alters protein binding (see, e.g., U.S. Pat App. No. 61/139,056 hereby incorporated by ref.) Subsequent sequencing and alignment may be used to distinguish methylated from unmethylated cytosines, illuminating the methylation pattern.14 It may also be advantageous for gene expression studies to sequence the transcriptome; i.e. the sequence of the RNA extracted from cells or tissues. This can give detailed information about the levels of expression, the splicing variation, and even allow for the identification of new non-coding RNAs, which may be involved in the regulation and are parts of the “dark matter” of the genome.15
All of these “-omes” would be facilitated by technology that inexpensively and quickly determines sequence information from a genetic sample. For this reason, ultra-low cost sequencing technologies have been identified as a scientific priority and significant effort is being devoted to its study and development.
Since its development in 1977, the Sanger method of DNA sequencing has transformed biology—it is the standard to which all other methods of sequencing are compared. 16 The basis for Sanger sequencing is the polymerase chain reaction (PCR), which is used in combination with dideoxy-terminated nucleotides triphosphates to prematurely terminate the elongation reaction. The classical chain-termination method requires a DNA template, a DNA primer, a DNA polymerase, nucleotides and fluorescently labeled nucleotides that terminate DNA strand elongation. By mixing fluorescently labeled dideoxynucleotides with deoxynucleotides, the PCR reaction is prematurely terminated, leading to fragmentary single stranded copies of the template that differ in length with the last base fluorescently labeled with a different fluorescent moiety, depending on the base. Separating these fragments by size through electrophoresis, the sequence can be determined from the color of fluorescence produced at a given length.
Though functional, this procedure is problematic for several reasons. The template read length using this method is limited to ˜800 bp. This introduces significant challenges, especially for de novo sequencing, requiring that either chromosome walking or shotgun sequencing be used, which are both time consuming and require re-assembly of the completed sequence. The chain termination reaction is also time consuming, as is electrophoretic separation, leading to the development of techniques for massively parallel methods for sequencing.17 But the overarching problems with Sanger sequencing method are the relatively large amounts of DNA required—amplification leads to errors—and the expense due to reagents for labeling and separation.
There are emerging technologies that have the potential to supersede conventional, Sanger sequencing and in some cases sequence the human genome for $1000 or less. Shendure et al have analyzed these technologies in detail.18 They can be loosely categorized as: bioMEMs, which is just an extension of conventional electrophoretic methods through miniaturization and integration; sequencing-by-hybridization, which uses the differential hybridization of oligonucleotide probes to decode the DNA sequence; massively parallel signature sequencing (MPSS), which is based on cycles of restriction digestion and ligation; and finally, non-enzymatic, real-time single-molecule sequencing.
BioMEMs has the advantage that it relies on the same tested principles as electrophoretic sequencing, which has already been used to sequence 10 11 nucleotides. Using variations of the Sanger process in conjunction with capillary array electrophoresis to separate deoxyribonucleotide triphosphate fragments, about 100 bp can be sequenced per minute at a cost of <$1 with an accuracy of about 99.99%, which is considered to be the gold standard, but it seems unlikely that a factor of 100,000× cost reduction will be achieved through scaling and integration alone. Hybridization sequencing has the advantage that the data collection method, i.e. scanning the florescence emitted by labeled DNA that has been hybridized to an array of probe sequences, is compatible with high-throughput, but probes have to be designed that avoid cross-hybridization to the wrong target. This renders 50% of the chromosome inaccessible. All methods like sequence-by-synthesis, cyclic-array sequencing on amplified molecules, and MPSS, which rely on some method of isolated clonal amplification are, first of all, costly and often problematic. For example, they may experience a low frequency of nucleotide misincorporation or non-incorporation, which manifests itself in signal decay through “dephasing”. In contrast, cyclic-array sequencing on single molecules eliminates the costly PCR-amplification step, requires less starting material with no risk of de-phasing, but achieving the signal-to-noise required for single molecule detection is still a challenge.
According to Mardis,19 right now the Roche GS-FLX (454) sequencer uses emulsion PCR to produce 100 Mb of data in 7 h with a 250 bp read length (per bead) at a cost of $8439 or $84.40 per Mb. In contrast, a run in an Applied Biosystems SOLiD (sequencing by oligo ligation) sequencer requires 5 days and produces 3-4 Gb of sequence data with an average read length of 25-35 bp, costing $5.81 per Mb. Applied Biosystems estimates that their SOLiD sequencer will be able to sequence an entire human genome for only $10,000 in just 2 weeks. Following Shendure's analysis, 18 for re-sequencing, the error rate has to be less than the expected variation in the sequence. Since human chromosomes differ at approximately 1 in every 1000 bp, an error rate of 1/100 kbp would be needed to ensure confidence. If the accuracy of a raw read is 99.7% (current state-of-the-art), then ×3 coverage of each base will yield this error rate. To ensure a minimum ×3 cover of >95% of the diploid human genome, ×6.5 coverage, or about 40 billion raw bases at a cost per base of <$10000, or 4 million bases per $1. If an improvement over SOLiD performance is derived simply from an increase in the acquisition rate per device, we would therefore need to sequence at a rate of ˜330,000 bp/s to reach a $1,000 genome. No assembly is required in re-sequencing a genome; the read needs only be long enough to allow it to be matched to a unique location in an assembled reference genome, and how it differs from the reference. In the mammalian genome only ˜73% of 20-bp genomic reads SOLID uses can be assigned to a single unique location. Achieving >95% uniqueness will require reads >60 bp. Thus, a re-sequencing instrument that can deliver a $1,000 human genome with reasonable coverage and accuracy will need to achieve >60 bp reads with 99.7% raw-base accuracy, acquiring data at a rate of 330,000 bp/s or 1 bp/3 μs. A faster instrument with longer reads will be cheaper still.
Single molecule DNA sequencing represents the logical end-of-the-line in development of sequencing technology, which extracts the maximum amount of information from a minimum of material and pre-processing. When paired with a high-throughput and low cost instrument, it would change the genomic flow of data from a trickle to a deluge. Specifically, the low material requirement coupled with quick results would allow for easy sequencing of precious primary samples from human patients, e.g. allowing doctors to sequence a biopsy from a tumor to determine the best chemotherapy. Moreover, it would represent a leap forward in determining the epigenome, the non-genetic marks on DNA which affect gene expression.