The term haplotype refers to the combination of alleles at multiple loci along a chromosome. In linkage disequilibrium (LD) mapping, haplotype-based tests are thought to improve power for detecting untyped variants. In population genetics studies of evolutionary histories, haplotype data have been used to detect recombination hotspots as well as regions that have undergone recent positive selection (Myers, S., Bottolo, L., Freeman, C., McVean, G. & Donnelly, P. A fine-scale map of recombination rates and hotspots across the human genome. Science 310, 321-324 (2005); Sabeti, P. C. et al. Detecting recent positive selection in the human genome from haplotype structure. Nature 419, 832-7 (2002); these and all other references cited herein are incorporated by reference for all purposes). Despite its usefulness, haplotypes cannot be directly assayed using existing high-throughput genomic or sequencing technologies which instead generate genotype data-unordered pairs of alleles. While reconstructing haplotype from genotypes is straightforward in some special settings (e.g. in the presence of relatives, in sperm, or for X chromosomes in males), statistical inference of haplotype from autosomal genotype data with no known relatives is challenging.
The paper by Kong et al (Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nature Genetics 40, 1068-75 (2008)) showed that distantly related individuals can be used accurately to infer haplotype phase from genotypes. Individuals sharing long haplotypes, “pseudo parents,” could be used to phase each other just as the two parents in a trio can be used to phase the unordered genotypes of their offspring. They showed an impressive performance improvement over an existing algorithm, FastPHASE (Scheet, P. & Stephens, M. A fast and flexible statistical model for largescale population genotype data: Applications to inferring missing genotypes and haplotypic phase. American Journal of Human Genetics 78, 629-644 (2006)), but due to high computational burden could not compare against the related PHASE 2.1.1 algorithm (Stephens, M., Smith, N. & Donnelly, P. A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics 68, 978-89 (2001)).
Numerous methods have been developed to infer haplotypes. The method of Clark, A. G. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 7, 111-122 (1990) begins by identifying a pool of unambiguous (homozygous) individuals, and phases the remaining individuals based on a parsimony heuristic that seeks to minimize the total number of distinct haplotypes in the sample. For a small number of linked markers, multinomial-based model fit by the Expectation-Maximization (EM) algorithms can be quite effective (Excoffier, L. & Slatkin, M. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 12, 921-927 (1995); Hawley, M. E. & Kidd, K. K. Haplo: a program using the em algorithm to estimate the frequencies of multi-site haplotypes. J Hered 86, 409-411 (1995); Long, J. C., Williams, R. C. & Urbanek, M. An E-M algorithm and testing strategy for multiple-locus haplotypes. Am J Hum Genet 56, 799-810 (1995)). The partition-ligation (PL-EM) algorithm of (Niu, T., Qin, Z. S., Xu, X. & Liu, J. S. Bayesian haplotype inference for multiple linked single nucleotide polymorphisms. American Journal of Human Genetics 70, 157-169 (2002)) was proposed to accelerate computation, and to keep the EM algorithm from becoming trapped in poor local modes. These methods identify common haplotypes. However, the multinomial model is inappropriate for rare haplotypes, which is a serious weakness because for any fixed sample size of individuals, a majority of haplotypes become rare or unique as the number of marker increases—either by increasing marker density or by expanding the genomic region.