Haplotype assembly from experimental data obtained from human genomes sequenced using massively parallelized sequencing methodologies has emerged as a prominent source of genetic data. Such data serves as a cost-effective way of implementing genetics based diagnostics as well as human disease study, detection, and personalized treatment.
The long-range information provided by platforms such as those disclosed in U.S. Patent Application No. 62/072,214, filed Oct. 29, 2014, entitled “Analysis of Nucleic Acid Sequences” greatly facilitates the detection of large-scale structural variations of the genome, such as translocations, large deletions, or gene fusions. Other examples include, but are not limited to the sequencing-by-synthesis platform (ILLUMINA), Bentley et al., 2008, “Accurate whole human genome sequencing using reversible terminator chemistry, Nature 456:53-59; sequencing-by-litigation platforms (POLONATOR; ABI SOLiD), Shendure et al., 2005, “Accurate Multiplex Polony Sequencing of an Evolved bacterial Genome” Science 309:1728-1732; pyrosequencing platforms (ROCHE 454), Margulies et al., 2005, “Genome sequencing in microfabricated high-density picoliter reactors,” Nature 437:376-380; and single-molecule sequencing platforms (HELICOS HELISCAPE); Pushkarev et al., 2009, “Single-molecule sequencing of an individual human genome,” Nature Biotech 17:847-850, (PACIFIC BIOSCIENCES) Eid et al., “Real-time sequencing form single polymerase molecules,” Science 323:133-138, each of which is hereby incorporated by reference in its entirety.
Several algorithms have been developed for detecting such events from whole genome sequencing (WGS) data. See, for example, Chen et al., 2009, “BreakDancer: an algorithm for high-resolution mapping of genomic structural variation,” Nature Methods 6(9), pp, 677-681 and Layer et al., 2014, “LUMPY: A probabilistic framework for structural variant discovery,” Genome Biology 15(6):R84. The goal of these algorithms is to detect the endpoints of structural variants (e.g., the endpoints of a deletion or a gene fusion). These endpoints are also referred to as “breakpoints” and the terms endpoints and breakpoints are used interchangeably. In order to detect breakpoints, existing algorithms rely on the detection of read pairs that are mapped to the genome at unexpected orientations with respect to each other or at unexpected distances (too far from each other or too close to each other relative to the insert size). This implies that, in order for the breakpoint to be detected by conventional algorithms, it must be spanned by read pairs. This limitation makes existing algorithms not applicable to targeted sequencing data, such as whole exome sequencing (WES) data. This is because the breakpoints would be spanned by read pairs only if they were very close to the target regions. This is usually not the case. For example many gene fusions in cancer happen on gene introns rather than exons, so they would not be detectable with WES.
The availability of haplotype data spanning large portions of the human genome, the need has arisen for ways in which to efficiently work with this data in order to advance the above stated objectives of diagnosis, discovery, and treatment, particularly as the cost of whole genome sequencing for a personal genome drops below $1000. To computationally assemble haplotypes from such data, it is necessary to disentangle the reads from the two haplotypes present in the sample and infer a consensus sequence for both haplotypes. Such a problem has been shown to be NP-hard. See Lippert et al., 2002, “Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem,” Brief. Bionform 3:23-31, which is hereby incorporated by reference.
Given the above background, what is needed in the art are improved systems and methods for determining the integrity of a first query string and a second query string with respect to a ground truth string (e.g., haplotype phasing and structural variant detection using sequencing data) from parallelized sequencing methodologies.