Examining a person's genes can reveal if that person has a genetic disease or even if he or she is a latent carrier of a disease, at risk of passing the disease on to his or her children. The information is the person's genes can be revealed by DNA sequencing. The DNA sequencing technologies known as next-generation sequencing (NGS) are capable of sequencing an entire human genome in under a day and for under $1,000. See Clark, Illumina announces landmark $1,000 human genome sequencing, Wired, 15 Jan. 2014.
However, the most informative view of a person's genes requires knowing his or her “haplotype”. Humans have two copies of each of their chromosomes, one inherited from the father and the other from the mother. There is variation between the two chromosomes of a pair. Most of that variation appears in the form of single nucleotide polymorphism (SNPs). For any heterozygous SNP, a person has a different allele on each chromosome at the location of the SNP. Those alleles that appear on the same chromosome can be said to belong to the same haplotype. Unfortunately, standard methods such as sequencing collect only genotype information but do not assign the alleles to haplotypes. Existing attempts to do so require comparison to a reference haplotype (e.g., Yang, 2013, Leveraging reads that span multiple single nucleotide polymorphisms for haplotype inference from sequencing data, Bioinformatics 29(18):2245-2252) or construction of a graph based on a SNP-fragment matrix that artificially reduces the two alleles of a SNP to a binary 0 or 1 score (e.g., Aguiar and Istrail, 2012, HapCompass: A fast cycle basis algorithm for accurate haplotype assembly of sequence data, J Comp Biol 19(6):577-590). Reducing the alleles of a SNP to a binary code is unsatisfactory for at least two reasons. First, the resulting matrix and graph will require an extrinsic “key” to decode the entries back into the haplotype, which adds a separate lookup step that is very time consuming if a computer is going to generate reports of the haplotypes in any high-throughput environment. Second, even though a heterozygous SNP will typically only have two alleles in a diploid genome, those SNPs may have as many as four alleles within a population and each allele may have independent medical significance. If SNP alleles are encoded as a binary 1 or 0, there is no basis for comparison among data sets. For example, if Smith has two heterozygous SNPs for which the first haplotype is AC and the second haplotype is GC, and Jones—at the homologous locations—has the first haplotype GT and the second haplotype TT, once Smith's haplotypes are defined as 11 and 01, respectively, there is left no way to define Jones' haplotypes that provides for a meaningful comparison to Smith's.