Field
The disclosed embodiments relate to models for phasing genomic samples into haplotypes. In particular, the disclosed embodiments relate to phasing algorithms that efficiently and accurately phase genomic samples.
Description of Related Art
Although humans are, genetically speaking, almost entirely identical, small differences in human DNA are responsible for much of the variation between individuals. For example, a sequence variation at one position in DNA between individuals is known as a single-nucleotide polymorphism (SNP). SNPs can serve as biomarkers for heredity and disease studies. Stretches of DNA inherited together from a single chromosome are referred to as haplotypes. Haplotypes are identified based on consecutive SNPs of varying length.
Traditional phasing algorithms separate diploid genotypes into a pair of haplotypes. These algorithms are capable of phasing many genomic samples simultaneously, comparing the genotypes and potential haplotypes to others in the input, and iteratively improving the phase over many iterations of the algorithm. However, known models (Browning S. R. and Browning B. L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering, American Journal of Human Genetics, 91:1084-1096, 2007; Ron. D, Singer Y, and Tishby N. On the learnability and usage of acyclic probabilistic finite automata, J. Comp Syst. Sci., 56:133-152, 1998) use phasing algorithms that become intractable when the input contains hundreds of thousands of samples, and new samples must be phased by rebuilding the model using existing (reference) phased samples and new unphased samples. These approaches are not practical for batches of hundreds of thousands of samples, where new samples are continuously being added to each batch. New models and algorithms are needed to efficiently and accurately phase large numbers of genomic samples.