A person's genetic information has the potential to reveal much about their health and life. A risk of cancer or a genetic disease may be revealed by the sequences of the person's genes, as well the possibility that his or her children could inherit a genetic disorder. Genetic information can also be used to identify an unknown organism, such as potentially infectious agents discovered in samples from public food or water supplies. Next-generation sequencing (NGS) technologies are available that can sequence entire genomes quickly. Sequencing by NGS produces a very large number of short sequence reads. Each sequence read represents a short sequence of part of the genome of an organism. Unfortunately, analyzing short sequences is not an easy task.
Some approaches to analyzing sequence reads involve mapping the sequence reads to a reference genome. Mapping reads to a reference can be done by aligning each read to a relevant portion of the reference. The amount and nature of reference genome data presents significant obstacles to successfully mapping sequence reads. Not only have many complete genomes been sequenced, even if the sequence reads are from a known species, there may be a great amount of known genetic variation in that species, i.e., a great diversity of different genotypes among members of the species. For example, the 1,000 Genomes Project is seeking to sequence the genomes of 1,000 humans. It quickly becomes computationally intractable to perform an exhaustive alignment for even a single sequence read against the entire length of each and every one of the known human genomes. In fact, it is not even a trivial problem to simply store and represent all of the complete sequences of all of the known genomes for some organisms. Thus whether screening a patient for a genetic disease or probing an unknown sample for a pathogen, existing analytical approaches are not up to the task of making all of the relevant comparisons that should be made to confidently analyze sequence reads. Unfortunately, sometimes where additional information is known, existing methods do not fully exploit that information for read mapping. For example, for a pair of paired-end reads, existing methods compare the location for each match of a mate against the location of each match for the other mate. Thus even where the first set of matches are analyzed with respect to distance between elements of a pair is O(n^2) in the number of candidates identified in the first filtering stage. Analysis that require resources on the order of O(n^2) pose a significant challenge and may be infeasible or cost-prohibitive.