Two of the major challenges in genome analysis are de novo genome sequence assembly based on ‘short read’ shotgun sequencing and structural variation analysis. Several approaches and combinations of different approaches have been attempted to meet these challenges. The most widely adopted strategy relies on deep sequencing of shotgun libraries and sequencing of mate-pair libraries, which increases the sequence contiguity of short-read sequencing (See, Siegel, A. F., et al. (2000) “Modeling the feasibility of whole genome shotgun sequencing using a pairwise end strategy.” Genomics 68(3): 237-246). The paired sequencing approach includes conventional mate-pair libraries, labor-intensive fosmid or BAC clone libraries (See Gnerre, S., et al. (2011) “High-quality draft assemblies of mammalian genomes from massively parallel sequence data.” Proceedings of the National Academy of Sciences of the United States of America 108(4): 1513-1518), Hi-C read-pairs for chromosome-scale scaffolding (See Burton, J. N., et al. (2013) “Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions.” Nature Biotechnology 31(12): 1119-1125) and transposase-mediated libraries (See Adey, A., et al. (2014) “In vitro, long-range sequence information for de novo genome assembly via transposase contiguity.” Genome Research 24(12): 2041-2049). Another approach relies on the stochastic separation of corresponding genomic or polymerase chain reaction (PCR) fragments into physically distinct pools followed by subsequent fragmentation to generate shorter sequencing templates (See, Kaper, F., et al. (2013). “Whole-genome haplotyping by dilution, amplification, and sequencing.” Proceedings of the National Academy of Sciences of the United States of America 110(14): 5552-5557; Kuleshov, V., et al. (2014) “Whole-genome haplotyping using long reads and statistical methods.” Nature Biotechnology 32(3): 261-266; Peters, B. A., et al. (2012) “Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells.” Nature 487(7406): 190-195; and Voskoboynik, A., et al. (2013) “The genome sequence of the colonial chordate, Botryllus schlosseri.” Elife 2(e00569)). With appropriate high-throughput reaction handling and barcoding, this strategy reduces the complexity, and thus can improve the quality, of assemblies. Longer-read sequencing technologies such as PacBio®'s SMRT and Oxford Nanopore sequencing promise to eventually further improve assembly contiguity. For example, SMRT sequencing has been successfully applied to closing some gaps and detecting some structural variations in the human reference genome (For example, See Chaisson, M. J. P., et al. (2015) “Resolving the complexity of the human genome using single-molecule sequencing.” Nature 517(7536): 608-611). However, their high error rate, low throughput and high cost have thus far prevented widespread adoption.
None of the aforementioned approaches, however, adequately address the problems of long-range de novo assembly contiguity and validation, sequence mis-assembly in complex segmentally duplicated and repetitive regions, and structural variant detection and delineation. Whole genome mapping technologies have been developed for these purposes as complementary tools to provide scaffolds for genome assembly and structural variation analysis. Optical mapping, pioneered by David Schwartz and colleagues has been used to construct restriction maps for various genomes and has proven to be very useful in providing scaffolds for shotgun sequence assembly and detection of structural variations (See, Samad, A., et al. (1995) “Optical Mapping—A novel, single-molecule approach to genomic analysis.” Genome Research 5(1): 1-4; and Teague, B., et al. (2010) “High-resolution human genome structure by single-molecule analysis.” Proceedings of the National Academy of Sciences of the United States of America 107(24): 10848-10853). Furthermore, Ming Xiao and colleagues developed a highly-automated whole genome mapping in a nanochannel array (Hastie, A. R., et al. (2013). “Rapid Genome Mapping in Nanochannel Arrays for Highly Complete and Accurate De Novo Sequence Assembly of the Complex Aegilops tauschii Genome.” Plos One 8(2); Lam, E. T., et al. (2012) “Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly.” Nature Biotechnology 30(8): 771-776 and US 2016/0168621 A1. Each of these documents is incorporated herein by reference.
The above-described genome mapping strategies are based on mapping the distribution of short (from 6 bp to 8 bp) sequence motifs across the genome. However, the distribution of the sequence motifs is uneven at different genomic regions. Often, there are no appropriate sequence motifs in repetitive genomic regions, which results in large segments of the genome that cannot be mapped (Feuk, L., et al. (2006). “Structural variation in the human genome.” Nature Reviews Genetics 7(2): 85-97). Another challenge resides in detecting and typing structural variations or clinical diagnostics of specific structural variants. Target sequence-specific labeling of the structural variations is required to obtain accurate breaking points, but this cannot be achieved by sequence-motif mapping.