Recent advances in sequencing technology have rapidly driven down the cost of DNA sequence data and yield an unrivalled resource of genetic information. Individual genomes can be characterized, while genetic variation may be studied in populations and disease. Until recently, the scope of sequencing projects was limited by the cost and throughput of Sanger sequencing. The raw data for the three billion base (3 gigabase (Gb)) human genome sequence was generated over several years for ˜$300 million using several hundred capillary sequencers. International Human Genome Sequencing Consortium, “Finishing the euchromatic sequence of the human genome” Nature 431:931-945 (2004). More recently, an individual human genome sequence has been determined for ˜$10 million by capillary sequencing. Levy et al., “The diploid genome sequence of an individual human” PLoS Biol. 5:e254 (2007). Several new approaches at varying stages of development aim to increase sequencing throughput and reduce cost. Margulies et al., “Genome sequencing in microfabricated high-density picoliter reactors” Nature 437:376-380 (2005); Shendure et al., “Accurate multiplex polony sequencing of an evolved bacterial genome” Science 309:1728-1732 (2005); Harris et al., “Single-molecule DNA sequencing of a viral genome” Science 320:106-109 (2008); and Lundquist et al., “Parallel confocal detection of single molecules in real time” Opt. Lett. 33:1026-1028 (2008). These techniques increase parallelization markedly by imaging many DNA molecules simultaneously. One instrument run produces typically thousands or millions of sequences that are shorter than capillary reads. Another human genome sequence was recently determined using one of these approaches. Wheeler et al., “The complete genome of an individual by massively parallel DNA sequencing” Nature 452:872-876 (2008). Moreover, an international consortium is currently in the process of determining the genome sequence of at least a thousand different human individuals (1000genomes.org/page.php?page=home). These human genome sequences are typically based on the pre-existing human reference sequence and are not assembled de novo (i.e., without prior knowledge of the reference sequence)
However, further improvements are necessary to improve the efficiency of these massively parallel sequencing systems to enable routine sequencing and assembly of complex genomes de novo (i.e., without a pre-existing reference sequence). Essentially all methods for assembling genomes de novo require pairs of sequencing reads that have an a priori defined orientation and spacing in the underlying genome. Short-distance read pairs (i.e., for example 25-500 bps) are usually employed, even to provide information regarding long-range contiguity of genome assemblies. Using such short-distance read pairs, genome assemblies remain highly fragmented. Approaches that improve amplification yield and sequencing efficiency of massively-parallel sequencers using short-distance read pairs would greatly improve the quality of genome assemblies.
The ability to produce sequence reads from distal ends of a single DNA fragment (paired-end sequencing) is extremely useful for many down stream analyses. Currently there are no sequencing by polymerase synthesis commercially available methods for effective paired-end sequencing from beads on any of the established bead-based sequencing technologies (AB Solid, Roche/454 and now Ion Torrent).