Variation in the human genome sequence is an important determinative factor in the etiology of many common medical conditions. Heterozygosity in the human population is attributable to common variants of a given genetic sequence, and those skilled in the art have sought to comprehensively identify common genetic variations and to link such variations to medical conditions [Lander, Science 274:536, 1996; Collins et al., Science 278:1580, 1997; Risch, Science 273:1516, 1996]. Recently, it has been estimated that 4 million [Sachidanandam et al., Nature 409:928 [2001]; Venter et al., Science 291: 1304, 2001] of the estimated 10 million [Kruglyak, Nature Genet 27:234, 2001] common single nucleotide polymorphisms (SNPs) are already known. These developments in the field of DNA sequence analysis therefore are providing a rapid accumulation of partially and completely sequenced genomes. The next challenge involves obtaining an inventory of sequence variations (genetic polymorphisms) found in population samples, and using that information to unravel the genetic basis of the phenotypic variation observed among the individuals of that population. Ideally, such analyses would directly reveal the causative genetic variants that biochemically determine the phenotype.
In practice, the identification of loci/polymorphisms that have important phenotypic effects involves searching through a large set of sequence variations to find surrogate markers that are statistically associated with the phenotypic differences through linkage disequilibrium (LD) with variation(s) (at other sites) that are directly causative. LD is the non-random association of alleles at adjacent polymorphisms. When a particular allele at one site, is found to be co-inherited with a specific allele at a second site—more often than expected if the sites were segregating independently in the population—the loci are in disequilibrium. LD has recently become the focus of intense study in the belief that it might offer a shortcut to the mapping of functionally important loci through whole-genome association studies.
Unfortunately, LD is not a simple function of distance and the patterns of genetic polymorphisms, shaped by the various genomic processes and demographic events, appear complex. Gene-mapping studies critically depend on knowledge of the extent and spatial structure of LD because the number of genetic markers should be kept as small as possible so that such studies can be applied in large cohorts at an affordable cost. Thus, an important analytical challenge is to identify the minimal set of SNPs with maximum total relevant information and to balance any reduction in the variation that is examined against the potential reduction in utility/efficiency of the genome-wide survey. Any SNP selection algorithm that is ultimately used should also account for the cost and difficulty of designing an assay for a given SNP on a given platform—a particular SNP may be the most informative in a region but it may also be difficult to measure.
Except for the human species, SNPs have thus far not been surveyed extensively in many other systems. One study [Tenaillon et al., Proc. Natl. Acad. Sci. USA 98: 9161-9166, 2001] investigated the sequence diversity in 21 loci distributed along chromosome 1 of maize (Zea mays ssp. mays L.). The sample consisted of 25 individuals representing 16 exotic landraces and nine U.S. inbred lines. The first and most apparent conclusion from this study is that maize is very diverse, containing on average one SNP every 28 bp in the sample. This is a level of diversity higher than that of either humans or Drosophila melanogaster. A second major conclusion from the study was that extended regions of high LD may be uncommon in maize and that genome-wide surveys for association analyses in maize require marker densities of one SNP every 100 to 200 bp.
Multi-SNP haplotypes have been proposed as more efficient and informative genetic markers than individual SNPs [Judson et al., Pharmacogenomics 1: 15-26, 2000; Judson et al;, Pharmacogenomics 3: 379-391, 2002; Stephens et al., Science 293: 489-493, 2001; Drysdale et al., Proc. Natl. Acad. Sci. USA 97: 10483-10488, 2000; Johnson et al., Nat. Genet. 29: 233-237, 2001]. Haplotypes capture the organization of variation in the genome and provide a record of a population's genetic history. Therefore, disequilibrium tests based on haplotypes have greater power than single markers to track an unobserved, but evolutionary linked, variable site.
Recent studies in human genetics [Daly et al., Nat. Genet. 29: 229-232, 2001; Daly et al., patent application US 2003/0170665 A1; Patil et al., Science 294: 1719-1723, 2001; Gabriel et al., Science 296: 2225-2229, 2002; Dawson et al., Nature 418: 544-548, 2002; Philips et al., Nat. Genet. 33: 382-387, 2003; reviewed by Wall & Pritchard, Nature Rev. Genet. 4: 587-597, 2003] have shown that at least part of the genome can be parsed into blocks: sizeable regions over which there is little evidence for recombination and within which only a few common haplotypes are observed, i.e. the sequence variants observed in a block often appear in the same allelic combinations in the majority of individuals. The major attraction of the ‘haplotype block’ model is that it may simplify the analysis of genetic variation across a genomic region—the idea is that a limited number of common haplotypes capture most of the genetic variation across sizeable regions and that these prevalent haplotypes (and the undiscovered variants contained in these haplotypes) can be diagnosed with the use of a small number of ‘haplotype tag’ SNPs (htSNPs). The ‘haplotype block’ concept has fuelled the International HapMap Project [http://www.hapmap.org; Dennis C., Nature 425: 758-759 (2003)]. So far, the haplotype block structure has only been investigated in humans.
Others have reported that a large proportion (75-85%) of the human and Drosophila melanogaster genomes are spanned by so-called “yin-yang haplotypes”, i.e. a pair of high-frequency haplotypes that are completely opposed in that they differ at every SNP [Zhang et al., Am. J. Hum. Genet. 73: 1073-1081, 2003].
Most recently, Carlson and coworkers [Carlson et al., Am. J. Hum. Genet. 74: 106-120, 2004] developed an algorithm to select the maximally informative subset of SNPs (referred to as tagSNPs) for assay in association studies. The selection algorithm is based on the pattern of LD rather than the ‘haplotype block’ concept. It makes use of the r2 LD statistic to group SNPs as a bin of associated sites. Within the bin any SNP that exceeds an adequately stringent r2 threshold with all other sites in the bin may serve as a tagSNP, and only one tagSNP needs to be genotyped per bin. SNPs that do not exceed the threshold with any other SNP in the region under study are placed in singleton bins.
The determination of haplotypes from diploid unrelated individuals, heterozygous at multiple loci, is difficult. Conventional genotyping techniques do not permit determination of the phase of several different markers. For example, a genomic region with N bi-allelic SNPs can theoretically yield 2N haplotypes in the case of complete equilibrium, whereas the actual number should be less than the number of SNPs in the absence of recombination events and recurrent mutations [Harding et al., Am. J. Hum. Genet. 60: 772-789, 1997; Fullerton et al., Am. J. Hum. Genet. 67: 881-900, 2000]. Large-scale studies [Stephens et al., Science 293: 489-493, 2001] indicate that the haplotype variation is slightly greater than the number of SNPs.
One approach for determining haplotypes is the use of molecular techniques to separate the two homologous genomic DNAs. DNA cloning, somatic cell hybrid construction [Douglas et al., Nat. Genet. 28: 361-364, 2001], allele-specific PCR [Ruano & Kidd, Nucl. Acids Res. 17: 8392, 1989], and single molecule PCR [Ruano et al., Proc. Natl. Acad. Sci. USA 87: 6296-6300, 1990; Ding & Cantor, Proc. Natl. Acad. Sci. USA 100: 7449-7453, 2003] have all been used. Alternatively, haplotypes may be resolved (partially) when the genotypes of first-degree relatives are available, e.g. father-mother-offspring trios [Wijsman E. M., Am. J. Hum. Genet. 41: 356-373, 1987; Daly et al., Nat. Genet. 29: 229-232, 2001].
To avoid the difficulties and cost in experimental and pedigree-based approaches, several computational algorithms have been developed to predict the phase from unrelated individuals or to estimate the population-haplotype frequencies. The approaches include Clark's parsimony method [Clark A. G., Mol. Biol. Evol. 7: 111-121, 1990], maximum likelihood methods such as the EM algorithm [Excoffier & Slatkin, Mol. Biol. Evol. 12: 921-927, 1995], methods based on Bayesian statistics such as PHASE [Stephens et al., Am. J. Hum. Genet. 68: 978-989, 2001] and HAPLOTYPER [Niu et al., Am. J. Hum. Genet. 52: 102-109, 2002], and perfect phylogeny-based methods [Bafna et al. J. Comput. Biol. 10: 323-340, 2003]. These probabilistic methods all have limitations in accuracy (dependent on the number of SNPs being handled and the size of the population being examined) and scalability.
A number of recent empirical studies [supra] have greatly augmented the knowledge of the overall structure of genetic variation. It should be noted, however, that for example the haplotype block concept remains to be validated, that not all regions of the human genome may fit the concept and/or that the concept may have limited value in other species. Irrespective of the outcome, the complexities of genetic variation data are such that the art would greatly benefit from novel breakthroughs that advance the understanding of the organization of a population's genetic variation, which would eventually lead to the identification/development of the most informative markers. Discoveries about the structure of genetic variations would be useful in different areas, including (i) genome-wide association studies, (ii) clinical diagnosis, (iii) plant and animal breeding, and (iv) the identification of micro-organisms.