A number of initiatives are currently underway to obtain sequence information directly from millions of individual molecules of DNA in parallel.
The real-time single molecule sequencing-by-synthesis technologies rely on the detection of fluorescent nucleotides as they are incorporated into a nascent strand of DNA that is complementary to the template being sequenced. An example of asynchronous single molecule sequencing by synthesis is illustrated in FIG. 1. As shown, oligonucleotides 30-50 bases in length are covalently anchored at the 5′ end to glass cover slips. These anchored strands perform two functions. First, they act as capture sites for the target template strands, if the templates are configured with capture tails complementary to the surface bound oligonucleotides. They also act as primers for the template-directed primer extension that forms the basis of the sequence reading. The capture primers are a fixed position site for sequence determination. Each cycle consists of adding the polymerase-labeled nucleotide analog mixture, rinsing, optically imaging the field containing millions of active primer template duplexes, and chemically cleaving the dye-linker to remove the dye. The cycle (synthesis, detection, and dye removal) is repeated up to 100 times and, possibly, more.
Four major high-throughput sequencing platforms are currently available: the Genome Sequencers from Roche/454 Life Sciences (Margulies et al. (2005) Nature, 437:376-380; U.S. Pat. Nos. 6,274,320; 6,258,568; 6,210,891), the 1G Analyzer from Illumina/Solexa (Bennett et al. (2005) Pharmacogenomics, 6:373-382), the SOLiD system from Applied Biosystems (solid.appliedbiosystems.com), and the Heliscope™ system from Helicos Biosciences (see, e.g., U.S. Patent App. Pub. No. 2007/0070349 and the illustration in FIG. 1). Although these new technologies are significantly cheaper compared to the traditional methods, such as gel/capillary Gilbert-Sanger sequencing, the sequence reads produced by the new technologies are generally much shorter (˜25-40 vs. ˜500-700 bases). For example, the average read lengths on the four major platforms are currently as follows: Roche/454, 250 bases (depending on the organism); Illumina/Solexa, 25 bases; SOLiD, 35 bases; Heliscope, 25 bases. While such short reads (also referred to as “microreads”) are sufficient for the resequencing ˜80% of normal human genomes, for which there is a reliable reference sequence, microreads are limiting for a number of other applications. First, short reads are not optimal for the de novo assembly of genomes. Second, the detection and proper placement of amplifications, inversions, and translocations using short reads are severely limited. The proper detection and placement of short indels are also difficult. Short reads may therefore be problematic for the resequencing of highly polymorphic or highly aberrant genomes. For example, the occurrence of Large-scale Copy-number Variations (LCVs) in normal (non-disease) individuals is an indication that acquiring an accurate description of human genetic variation may require more than the detection of single-nucleotide polymorphisms. Genetic rearrangements are even more heterogeneous and prevalent in cancer genomes, underscoring the importance of their proper detection and characterization.
To mitigate the drawbacks of short-read sequencing methods, several groups have proposed to use paired reads, the approach originally developed by Edwards et al. (Genomics (1990) 6: 593-608) for traditional sequencing methods. By linking two reads positioned a certain known distance apart from each other (thus also referred to as “paired,” “paired-end,” “mate,” or “matched” reads), a large informatic leverage is achieved to resolve repeats, insertions, deletions, and inversions, which are important mutation types, for example, in tumors.
In addition to the above limitations of short reads, another inherent disadvantage of single molecule sequencing technologies is a high per-read error rate. This is due to the all-or-none signal detection during an incorporation event and the increased susceptibility to contaminating nucleotides. For instance, the incorporation of an unlabeled nucleotide contaminant in a single nascent strand of complementary DNA will produce a failed detection event or a deletion in the read relative to the reference. Sequencing errors in short reads are especially problematic as they complicate proper alignment of the reads onto a reference sequence. Thus, techniques that allow re-sequencing the original DNA template are preferable because they drastically reduce sequencing errors. See, e.g., WO 2007/12089 and U.S. patent application Ser. No. 11/404,675 for “melt-and-resequence” methods.
Therefore, there is a need for methods of obtaining paired reads in sequencing-by-synthesis technologies, particularly, those with short read lengths, such as produced by single molecule sequencing.