Single nucleotide polymorphisms (SNPs) are genetic polymorphisms which can be found every 250-350 base pairs in the human genome (Beaudet et al. 2001). SNPs are useful, for example, for mapping the genetic components of complex diseases and drug responses. Because SNPs are typically biallelic, SNP genotyping is more amenable to automation and miniaturization than microsatellite loci. SNP genotyping can use a variety of high-throughput genotyping platforms, such as mass spectrometry (Ross et al. 1998), molecular beacons (Tyagi and Kramer 1996), TaqMan™ assay (Ranade et al. 2001), and high-density oligonucleotide microchips (Hacia et al. 1999), as well as other methods such as denaturing high performance liquid chromatography and fluorescence-based DNA sequencing (Niu et al. 2001a) or in silico SNP screening (Cox et al. 2001).
A variety of genotyping techniques, including many implementations of the above examples, are available to determine for a given locus, whether or not an individual has a particular allele. Frequently, the genotyping techniques only provide unphased genetic information. In other words, the methods can indicate whether a particular allele is present in an individual at a given locus, but not whether it is on the same chromosome as other alleles. In contrast, phased genotypic information includes information about whether a particular allele is on the same chromosome as an allele of another locus. Resolving an individual's haplotype (in other words, “phasing” the genotypic information) requires determining or inferring whether an allele is present on the maternal chromosome, paternal chromosome, both chromosomes, or neither. Haplotypic information includes the results of such a determination for multiple linked alleles.
In the following example, the presence of a particular allele of a biallelic marker is indicated by a “1”; its absence is indicated by a “0.” There is usually more than one possible solution to the phasing problem, as is evident for the following example which considers five linked genetic biallelic loci:
UNPHASED:1 1 0 1 0MATERNAL:1 1 0 1 0PATERNAL:1 0 0 0 0
Absent other information, alternate solutions are possible. One alternate solution is:
UNPHASED:1 1 0 1 0MATERNAL:0 1 0 1 0PATERNAL:1 1 0 0 0
However, the tremendous amount of SNP data presents a challenge for haplotype determination. The challenge arises in part because (1) a single SNP has a relatively low information content, and (2) for a gene with multiple tightly linked SNPs, not only would the locus disequilibrium (LD) information contained in flanking markers be ignored in the single SNP-based approach, but also a Bonferroni correction is often required to protect against an inflated type I error. Thus, the “haplotype-centric” approach, which combines the information of adjacent SNPs into composite multi-locus haplotypes, is more desirable. Haplotypes not only are more informative, but also capture the regional LD information, which is arguably more robust and powerfull (Pritchard 2001; Akey et al. 2001; Daly et al. 2001).
For autosomal loci, if only the multilocus phenotypes (“phenotype,” in this context, denotes unphased genotype configurations) for each individual are provided, the phase information for those individuals with multiply heterozygous phenotypes is inherently ambiguous. Whenever a particular individual has no more than one heterozygous site, the situation is simple and the individual's haplotype phase can be resolved with certainty. True resolution for the ambiguous (i.e., multiply heterozygous) phenotypes depends on molecular haplotyping or typing of close biological relatives. For molecular haplotyping, existing methods include: single molecule dilution (Ruano et al. 1990), allele-specific long-range PCR (Michalatos-Beloin et al. 1996), isothermal rolling circle amplification (Lizardi et al. 1998), long-insert cloning (Bradshaw et al. 1995; Ruano et al. 1990), and carbon nanotube probing (Woolley et al. 2000), and the diploid-to-haploid conversion method (Douglas et al. 2001). See also (Judson and Stephens 2001) for a discussion.
The typing of close relatives can reduce the phase ambiguity, but the phase determination can still be problematic when the number of loci is only moderately large (Hodge et al. 1999).
Existing in silico haplotype determination methods can be use to phase commonly-occurring haplotypes in a reasonably-sized sample of individuals even when some of the model assumptions are strongly violated. There are primarily three categories of algorithms for inferring haplotype phases of individual genotype data: Clark's algorithm (Clark 1990), the expectation-maximization (EM) algorithm (Excoffier and Slatkin 1995; Chiano and Clayton 1998; Hawley and Kidd 1995; Long et al. 1995), and a pseudo-Bayesian algorithm (Stephens et al. 2001a).
Clark's parsimony approach attempts to assign the smallest number of haplotypes for the observed genotype data through convoluted updating of the haplotype list starting from phase-unambiguous individuals. Clark's algorithm has been used to delineate gene-based haplotype variations (Stephens et al. 2001b) and the genome-wide LD in populations with different histories (Reich et al. 2001).
The EM algorithm starts with an initial guess of haplotype frequencies and iteratively updates the frequency estimates so as to maximize the log-likelihood function. An EM-based haplotype estimation has been used in the transmission disequilibrium tests (Zhao et al. 2000), and can function under a wide range of parameter settings (Fallin and Schork 2001).
Stephens et al. (2001a) employed an iterative stochastic sampling strategy (the Pseudo-Gibbs Sampler, or PGS henceforth) for the assignment of haplotype phases. The performance of the PGS is likely due to both the employment of a stochastic search strategy and the incorporation of the coalescence theory in its iteration steps. The coalescence model is appropriate, for example, for describing a stable population that has evolved for a long period of time.