Genomics holds much promise for huge improvements in human healthcare. Despite major advances in high-throughput sequencing, genomics faces several practical challenges. Accurate de novo genome assembly of sequence reads and structural variant analysis using “short read” shotgun sequencing remain challenging and represent the weak link in genome projects (Blakesley, et al. BMC Genomics 11: 21, 2010; Chain, et al. Science 326: 236-237, 2009). Most re-sequencing projects rely on mapping the sequencing data to the reference sequence to identify variants of interest (Ley et al., Nature 456, 66-72, 2008). When whole genome assembly is attempted, it is done by paired-end sequencing of cloned genomic DNA fragments to provide scaffolds for assembly (Siegel et al., Genomics 68, 237-246, 2000). Cloning of large DNA fragments is difficult. Therefore small insert libraries of varying sizes have been prepared for paired-end sequencing, thus limiting the resolution of haplotypes and increasing the complexity, time, and cost of the sequencing project. In addition, complex genomic loci, such as the major histocompatibility (MHC) region, are important for infectious and autoimmune diseases (Fernando et al., PLoS Genet 4, e1000024, 2008). These regions contain highly repetitive sequences and are particularly challenging for sequence assembly. As such, robust technologies that can aid in de novo sequence assembly are sorely needed as whole genome sequencing becomes more widely adopted.
Emerging whole genome scanning techniques reveal the prevalence and importance of structural variation. Detecting copy number variation often relies on detection of relative signal intensities by array-based or quantitative PCR-based technologies. Array-based methods, such as array-based comparative genomic hybridization (aCGH), have been used extensively in interrogation of copy number variation in the human genome (Sebat et al., Science 305, 525-528, 2004; Iafrate et al., Genet 36, 949-951, 2004). Except for deletions, however, these methods do not provide positional information regarding the locations of copy number variants (CNVs) and cannot detect balanced structural variation, such as inversions or translocations (Carter, Nat Genet 39(7 Suppl): S16-21, 2007). Paired-end mapping techniques, traditionally by Sanger sequencing and now by next-generation sequencing (Medvedev et all., Nat Meth 6, S13-S20, 2009), generally have low sensitivity in repetitive regions, where most of the structural variation lies (Feuk et al., Rev Genet 7, 85-97, 2006). Recent efforts to characterize CNVs in human genomes at high resolution involve paired-end mapping of clones, but this approach, while useful for exploratory studies in this small sample set, is too labor-intensive and time-consuming to be applicable for analysis of large numbers of individuals. Furthermore the resolution is no better than 8 kb (Kidd et al., Nature 453, 56-64, 2008).
Restriction mapping was instrumental in the Human Genome Project. One approach to address drawbacks of traditional restriction mapping is optical mapping (Jing et al., Proceedings of the National Academy of Sciences 95, 8046-8051, 1998). In this approach, large DNA fragments are stretched and immobilized on glass slides and cut in situ with restriction enzymes. Optical mapping was used to construct ordered restriction maps for whole genomes (Zhou et al., BMC Genomics 8, 278, 2007; Zhou et al. PLoS Genet 5, e1000711, 2009; Church et al., PLoS Biol 7, e1000112, 2009; Teague et al., PNAS 107, 10848-10853) and it provided scaffolds for shotgun sequence assembly and validation (Wu et al., BMC Genomics 10, 25, 2009; Latreille et al., BMC Genomics 8, 321, 2007). This method, however, is limited by its low throughput, non-uniform DNA stretching, imprecise DNA length measurement, and high error rates.
Therefore despite all developments in high throughput sequencing, there remains a need art for method of sequencing the whole genome with great accuracy, low cost and within a reasonable timeline.