Haplotype assembly from experimental data obtained from human genomes sequenced using massively parallelized sequencing methodologies has emerged as a prominent source of genetic data. Such data serves as a cost-effective way of implementing genetics based diagnostics as well as human disease study, detection, and personalized treatment.
The long-range information provided by such massively parallelized sequencing methodologies is disclosed, for example, in U.S. Patent Application No. 62/072,214, filed Oct. 29, 2014, entitled “Analysis of Nucleic Acid Sequences.” Such techniques greatly facilitate the detection of large-scale structural variations of the genome, such as translocations, large deletions, or gene fusions. Other examples include, but are not limited to the sequencing-by-synthesis platform (ILLUMINA), Bentley et al., 2008, “Accurate whole human genome sequencing using reversible terminator chemistry, Nature 456:53-59; sequencing-by-litigation platforms (POLONATOR; ABI SOLiD), Shendure et al., 2005, “Accurate Multiplex Polony Sequencing of an Evolved bacterial Genome” Science 309:1728-1732; pyrosequencing platforms (ROCHE 454), Margulies et al., 2005, “Genome sequencing in microfabricated high-density picoliter reactors,” Nature 437:376-380; and single-molecule sequencing platforms (HELICOS HELISCAPE); Pushkarev et al., 2009, “Single-molecule sequencing of an individual human genome,” Nature Biotech 17:847-850, (PACIFIC BIOSCIENCES) Eid et al., “Real-time sequencing form single polymerase molecules,” Science 323:133-138, each of which is hereby incorporated by reference in its entirety.
The availability of haplotype data spanning large portions of the human genome, the need has arisen for ways in which to efficiently work with this data in order to advance the above stated objectives of diagnosis, discovery, and treatment, particularly as the cost of whole genome sequencing for a personal genome drops below $1000. To computationally assemble haplotypes from such data, it is necessary to disentangle the reads from the two haplotypes present in the sample and infer a consensus sequence for both haplotypes. Such a problem has been shown to be NP-hard. See Lippert et al., 2002, “Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem,” Brief. Bionform 3:23-31, which is hereby incorporated by reference.
The assembly view Consed supports visualization of reads obtained from the above-identified sequencing methods. See Gordon 1998, “Consed: A graphical tool for sequencing finishing,” Genome Research 8:198-202.
Another visualization tool is EagleView. See Huang and Marth, 2008, “EagleView: A genome assembly viewer for next-generation sequencing technologies,” Genome Research 18:1538-1543.
Still another such viewer is HapEdit. See Kim et al., “HapEdit: an accuracy assessment viewer for haplotype assembly using massively parallel DNA-sequencing technologies.” Nucleic Acids Research, 2011, 1-5. HapEdit provides tools for assessing the accuracy of Haplotype assemblies and permits a user to fit the composition rates of reads sequence by numerous different sequencing technologies.
While the above-disclosed programs are each significant advancements in their own right, they do not adequately address the need in the art for tools for visually assessing structural variants (e.g., deletions, duplications, copy-number variants, insertions, inversions, translocations, long terminal repeats (LTRs), short tandem repeats (STRs), and a variety of other useful characterizations) in sequencing data.