Recent developments in DNA sequencing technology have raised the possibility of highly personalized, preventive medicine on the genomic level. Additionally, the possibility of rapidly acquiring large amounts of sequence data from multiple individuals within one or more populations may usher in a new phase of the genomics revolution in biomedical science.
Single base differences between genotypes can have substantial phenotypic effects. For example, over 300 mutations have been identified in the gene encoding phenylalanine hydroxylase (PAH), the enzyme that converts phenylalanine to tyrosine in phenylalanine catabolism and protein and neurotransmitter biosynthesis that result in a deficient enzyme activity and lead to the disorders hyperphenylalaninaemia and phenylketonuria. See, e.g., Jennings et al., Eur J Hum Genet 8, 683-696 (2000).
Sequence data can be obtained using the Sanger sequencing method, in which labeled dideoxy chain terminator nucleotide analogs are incorporated in a bulk primer extension reaction and products of differing lengths are resolved and analyzed to determine the identity of the incorporated terminator. See, e.g., Sanger et al., Proc Natl Acad Sci USA 74, 5463-5467 (1997). Indeed, many genome sequences have been determined using this technology. However, the cost and speed of acquiring sequence data by Sanger sequencing can be limiting.
New sequencing technologies can produce sequence data at an astounding rate—hundreds of megabases per day, with costs per base lower than for Sanger sequencing. See, e.g., Kato, Int J Clin Exp Med 2, 193-202 (2009). However, the raw data obtained using these sequencing technologies can be more error prone than traditional Sanger sequencing. This can result from obtaining information from individual DNA molecules instead of a bulk population.
For example, in single molecule sequencing by synthesis, a base could be skipped due to the device missing a weak signal, or due to lack of signal resulting from fluorescent dye bleaching, or due to the polymerase acting too fast to be detected by device. All of the above events result in a deletion error in the raw sequence. Similarly, mutation errors and insertion errors can also happen at a higher frequency for the simple reasons of potentially weaker signals and faster reactions than in conventional methods.
Low accuracy sequence data is more difficult to assemble. In large scale sequencing, such as sequencing a complete eukaryotic genome, the DNA molecules are fragmented into smaller pieces. These pieces are sequenced in parallel, and then the resultant reads are assembled to reconstruct the whole sequence of the original sample DNA molecules. The fragmentation can be achieved, for example, by mechanical shearing or enzymatic cleavage.
Assembly of small reads of sequence into a large genome requires that the fragmented reads are accurate enough to be correctly grouped together. This is generally true for the raw sequencing data generated from the Sanger method, which can have a raw data accuracy of higher than 95%. Accurate single molecular sequencing technology could be applied to detect single-base modifications or mutations nucleic acid samples. However, the raw data accuracy for single molecule sequencing technologies may be lower due to the limitations discussed above. The accuracy of individual reads of raw sequence data could be as low as 60 to 80%. See, e.g., Harris et al., Science 320:106-109 (2008). Thus, it would be useful to provide accurate single molecule sequencing methods.
Additionally, DNA methylation plays a critical role in the regulation of gene expression; for example, methylation at promoters often leads to transcriptional silencing. Methylation is also known to be an essential mechanism in genomic imprinting and X-chromosome inactivation. However, progress in deciphering complex whole genome methylation profiles has been limited. Therefore, methods of determining DNA methylation profiles in a high-throughput manner could be useful, more so should the methods also provide for accurate determination of sequence.