Haplotypes provide a powerful tool for today's geneticists and gene hunters. The use of single nucleotide polymorphisms (“SNPs”) for haplotype-based disease gene mapping and identification is becoming increasingly popular. A significant practical problem associated with SNP haplotype-based gene mapping is related to phase determination. A first step in determining a haplotype from SNP data is to obtain raw sequence (or genotype) data from an individual or collection of individuals by sequencing (or genotyping) their DNA. When proceeding from genomic DNA, two alleles of the target locus are normally present, so the sequencing procedure yields signal for both alleles. An individual who is homozygous at the locus will have one sequence present because the sequence for the two alleles is identical. If the individual is heterozygous at the locus, the sequences at the two alleles will not be identical and there will be ambiguity at polymorphic sites within the locus. For illustration purposes, consider two alleles having the sequence of SEQ ID NO: 1 ATTTGGCCTA and SEQ ID NO: 2 ATCTGGGCTT, at the third position there is either a T or a C, at the seventh position there is either a G or a C, and at the tenth position there is either an A or a T. Thus, the haplotype of the first allele is characterized by TGA at the third, seventh, and tenth positions and the haplotype at the second allele is CCT at the respective positions. The phase problem stems from the fact that the sequencing procedure cannot distinguish between the T and C at the third position, the G or C at the seventh position, and the A and T at the tenth position. There is ambiguity in the raw data. Based only on the raw data, any permutation of the genotypes for the alleles such as TCT, TGT, CGA, TCA, and CCA, can potentially be the naturally occurring haplotype. Resolution of the naturally occurring haplotype can be accomplished by a variety of molecular haplotyping methods including haplotyping of close relatives, allele specific long range PCR, and the such, but these procedures are typically difficult to implement in a cost effective, efficient manner.
Alternatively, in silico haplotype-determination methods have found increasing utility in view of the difficulties related to the empirical experimental determinations. Three general classes of computer implemented haplotype-determination methods are currently available based on parsimony, Bayesian, and expectation-maximum strategies. A parsimony based method relies on a known phased haplotype to generate the haplotypes from a group of genotypes. Parsimony type methods are based on the assumption that the “correct” set of haplotypes that explain a dataset of genotypes is the one that requires the least number of haplotypes. Such a parsimony based method can be implemented according to Clark's algorithm (Clark Mol. Bio. Evo. 7:111-122 (1990)). Clark's method is only feasible when the dataset contains a sufficient number of homozygotes. A Bayesian type approach involves a stochastic search strategy to determine haplotypes (Stephens et al. Am. J. Hum. Gen. 68:978-989 (2001); Niu et al. Am. J. Hum. Gen. 70:157-169 (2002)). Another method involves an expectation-maximum analysis (EM). The EM analysis involves an initial guess at the haplotype frequencies and an optimization of the log-likelihood function through iterative updates of the frequency estimation.
The EM method relates back to a gene counting method devised by Ceppellini et al. (Annals Hum. Gen. 20:97-115 (1955)). Ceppellini et al. solved the problem of obtaining maximum likelihood estimates, “MLEs”, for allele frequencies when the alleles are not expressed codominantly. In the canonical example, frequencies for the alleles of the ABO blood group system are obtained despite O being recessive to both A and B. This is done by allocating the count of individuals with blood type a partially to each of the genotypes AA and AO, with the share being determined by an initial guess at the relative frequencies of the A and O alleles.
Given the inferred genotype counts, it is straightforward to get MLEs of allele frequencies. With these new allele frequency estimates the count for type a individuals is reallocated, which leads to new allele frequency estimates and so on. The method is performed similarly with type b individuals. By iterating through this process the MLEs of the allele frequencies are obtained, given the incomplete information provided by the blood group phenotypes. Dempster et al. (J. Roy. Stat. Soc., Ser. B, 39:1-38 (1977)) provided considerable generalization of this method to what is now the well-known EM algorithm. Gene counting was applied to the problem of estimating haplotype frequencies from observations of genotypes at proximal loci. See, e.g., Excoffier & Slatkin Mol. Bio. Evo. 12:921-927 (1995); Hawley & Kidd J: Hered. 86:772-789 (1995), Long et al. Am. J. Hum. Gen. 56:799-810 (1995).
A substantial problem with the EM method is that the computational requirements of current implementations grow exponentially as the number of loci grows. For instance, in considering k diallelic loci there are 2k possible haplotypes, 2k−1(2k+1) possible unordered haplotype pairs, and 3k possible sets of genotype observations. In some cases, like a very large sample and a few loci, all haplotypes might be observed, or at least cannot be shown not to occur given the incomplete information provided by the genotypes. However, even with moderately sized samples and a moderate number of loci, it is clear that many haplotypes do not accumulate any of the partial counts derived from the genotype counts. The method presented herein makes use of this by building up the set of haplotypes that have non-zero counts or counts above a minimum threshold and then carrying out the EM algorithm using only these haplotypes, and the haplotype pairs and genotype lists derived from them. In most cases the number of haplotypes required is very small compared with the number of all possible haplotypes. Clearly, the total number of haplotypes that can actually exist in a sample cannot be more than twice the number of individuals and will typically be very much smaller. Even with the extra haplotypes that have to be considered due to incomplete information, the total that is needed grows approximately linearly with the size of the sample, rather than exponentially with the number of loci.