The DNA that makes up human chromosomes provides the instructions that direct the production of all proteins in the body. These proteins carry out the vital functions of life. Variations in the sequence of DNA encoding a protein produce variations or mutations in the proteins encoded, thus affecting the normal function of cells. Although environment often plays a significant role in disease, variations or mutations in the DNA of an individual are directly related to almost all human diseases, including infectious disease, cancer, and autoimmune disorders. Moreover, knowledge of genetics, particularly human genetics, has led to the realization that many diseases result from either complex interactions of several genes or their products or from any number of mutations within one gene. For example, Type I and II diabetes have been linked to multiple genes, each with its own pattern of mutations. In contrast, cystic fibrosis can be caused by any one of over 300 different mutations in a single gene.
Additionally, knowledge of human genetics has led to a limited understanding of variations between individuals when it comes to drug response—the field of pharmocogenetics. Over half a century ago, adverse drug responses were correlated with amino acid variations in two drug-metabolizing enzymes, plasma cholinesterase and glucose-6-phosphate dehydrogenase. Since then, careful genetic analyses have linked sequence polymorphisms (variations) in over 35 drug metabolism enzymes, 25 drug targets and 5 drug transporters with compromised levels of drug efficacy or safety (Evans and Relling, Science 296:487-91 (1999)). In the clinic, such information is being used to prevent drug toxicity; for example, patients are screened routinely for genetic differences in the thiopurine methyltransferase gene that cause decreased metabolism of 6-mercaptopurine or azathiopurine. Yet only a small percentage of observed drug toxicities have been explained adequately by the set of pharmacogenetic markers validated to date. Even more common than toxicity issues may be cases where drugs demonstrated to be safe and/or efficacious for some individuals have been found to have either insufficient therapeutic efficacy or unanticipated side effects in other individuals.
In addition to the importance of understanding the effects of variations in the genetic make up of humans, understanding the effects of variation in the genetic makeup of other non-human organisms—particularly pathogens—is important in understanding their effect on or interaction with humans. For example, the expression of virulence factors by pathogenic bacteria or viruses greatly affects the rate and severity of infection in humans that come into contact with such organisms. In addition, a detailed understanding of the genetic makeup of experimental animals, i.e., mice, rats, etc., is also of great value. For example, understanding the variations in the genetic makeup of animals used as model systems for evaluation of therapeutics is important for understanding the test results obtained using these systems and their predictive value for human use.
Because any two humans are 99.9% similar in their genetic makeup, most of the sequence of the DNA of their genomes is identical. However, there are variations in DNA sequence between individuals. For example, there are deletions of many-base stretches of DNA, insertion of stretches of DNA, variations in the number of repetitive DNA elements in non-coding regions, and changes in single nitrogenous base positions in the genome called “single nucleotide polymorphisms” (SNPs). Human DNA sequence variation accounts for a large fraction of observed differences between individuals, including susceptibility to disease.
Although most SNPs are rare, it has been estimated that there are 5.3 million common SNPs, each with a frequency of 10-50%, that account for the bulk of the DNA sequence difference between humans. Such SNPs are present in the human genome once every 600 base pairs (Kruglyak and Nickerson, Nature Genet. 27:235 (2001)). Alleles (variants) making up blocks of such SNPs in close physical proximity are often correlated, resulting in reduced genetic variability and defining a limited number of “SNP haplotypes”, each of which reflects descent from a single, ancient ancestral chromosome (Fullerton, et al., Am. J. Hum. Genet. 67:881 (2000)).
The complexity of local haplotype structure in the human genome—and the distance over which individual haplotypes extend—is poorly defined. Empiric studies investigating different segments of the human genome in different populations have revealed tremendous variability in local haplotype structure. These studies indicate that the relative contributions of mutation, recombination, selection, population history, and stochastic events to haplotype structure vary in an unpredictable manner, resulting in some haplotypes that extend for only a few kilobases (kb), and others that extend for greater than 100 kb (A. G. Clark et al., Am. J. Hum. Genet. 63:595 (1998)).
These findings suggest that any comprehensive description of the haplotype structure of the human genome, defined by common SNPs, will require empirical analysis of a dense set of SNPs in many independent copies of the human genome. Such whole-genome analyses would provide a fine degree of genetic mapping and pinpoint specific regions of linkage. Until the present invention, however, the practice and cost of genotyping over 3,000,000 SNPs across each individual of a reasonably sized population has made this endeavor impractical. The present invention allows for, among a wide variety of applications, whole-genome association analysis of populations using SNP haplotypes.