This application describes methods for the genetic analysis of biologically, medically and economically significant traits in mammals and other organisms, including humans. Genetic analysis refers to the determination of the nucleotide sequence of a gene or genes of interest in a subject organism, including methods for analysis of one site of sequence variation (i.e. genotyping methods) and methods for analysis of a collection of sequence variations (haplotyping methods). Genetic analysis further includes methods for correlating sequence variation with disease risk, diagnosis, prognosis or therapeutic management.
The use of novel genotyping and haplotyping methods for genetic analysis of the apolipoprotein E (ApoE) gene are described. These methods entail use of novel ApoE DNA sequence polymorphisms and haplotypes. The ApoE alleles and genetic analysis methods of this application will allow more sensitive measurement of the contribution of ApoE genetic variation to medically important phenotypes such as risk of heart disease, risk of Alzheimer's disease and a response to various therapeutic interventions, including pharmacotherapy.
This application also describes new methods for genotyping a DNA sample based on analysis of the mass of cleaved DNA fragments using mass spectrometry. These genotyping methods are better suited to the present and future requirements of DNA testing than current genotyping methods as a result of improved accuracy, decreased set-up and reagent costs, reduced complexity and excellent compatibility with automation.
At present, DNA diagnostic testing is largely concerned with identification of rare polymorphisms related to Mendelian traits. These tests have been in use for well over a decade. In the future genetic testing will come into much wider clinical and research use, as a means of making predictive, diagnostic, prognostic and pharmacogenetic assessments. These new genetic tests will in many cases involve multigenic conditions, where the correlation of genotype and phenotype is significantly more complex than for Mendelian phenotypes. To produce genetic tests with the requisite accuracy will require new methods that can simultaneously track multiple DNA sequence variations at low cost and high speed, without compromising accuracy. Many tests will be evaluated in the clinical research setting but only a small fraction will become major diagnostic tests; the clinical research process will reveal that most polymorphisms lack significant functional effects. The genetic analysis methods described in this application are relatively inexpensive to set up and run, while providing extremely high accuracy, and, most important, enabling sophisticated genetic analysis. They are therefore optimally suited to the exigencies of genetic test development in coming years.
The association of specific genotypes with disease risk, prognosis, and diagnosis as well as selection of optimal therapy for disease are some of the benefits expected to ensue from the human genome project. At present, the most common type of genetic study design for testing the association of genotypes with medically important phenotypes is a case control study where allele frequencies are measured in one or more phenotypically defined groups of cases and compared to allele frequencies in controls. (Alternatively, phenotype frequencies in two or more genotypically defined groups are compared.) The majority of such published genetic association studies have focused on measuring the contribution of a single polymorphic site (usually a single nucleotide polymorphism, abbreviated SNP) to variation in a medically important phenotype or phenotypes. In these studies one polymorphism serves as a proxy for all variation in a gene (or even a cluster of adjacent genes).
The limitations of such single polymorphism association analysis are becoming increasingly apparent. Recent articles (e.g. Terwilliger, J. and K. M Weiss. Linkage disequilibrium mapping of complex disease: fantasy or reality? Current Opinion in Biotechnology 9: 578-594, 1998) have drawn attention to the low quality of most association studies using single polymorphic sites (evidenced by their low degree of reproducibility). Some of the reasons for the lack of reproducibility of many association studies are apparent. In particular, the extent of human DNA polymorphism—most genes contain 10 or more polymorphic sites, and many genes contain over 100 polymorphic sites—is such that a single polymorphic site can only rarely serve as a reliable proxy for all variation in a gene (which typically covers at least several thousand nucleotides and can extend over 1,000,000 nucleotides). Even in cases where one polymorphic site is responsible for significant biological variation, there is no reliable method for identifying such a site. The haplotyping and genetic analysis methods described in this application provide a systematic way to identify such polymorphic sites.
Several recent studies have begun to outline the extent of human molecular genetic variation. For example, a comprehensive survey of genetic variation in the human lipoprotein lipase (LPL) gene (Nickerson, D. A., et al. Nature Genetics 19: 233-240, 1998; Clark, A. G., et al. American Journal of Human Genetics 63: 595-612, 1998) compared 71 human subjects and found 88 varying sites in a 9.7 kb region. On average any two versions of the gene differed at 17 sites. This and other studies show that sequence variation may be present at approximately 1 in 100 nucleotides when 50 to 100 unrelated subjects are compared. The implications of the this data are that, in order to create genetic diagnostic tests of sufficient specificity and selectivity to justify widespread medical use, more sophisticated methods are needed for measuring human genetic variation.
Beyond tests that measure the status of a single polymorphic site, the next level of sophisication in genetic testing is to genotype two or more polymorphic sites and keep track of the genotypes at each of the polymorphic sites when calculating the association between genotypes and phenotypes (e.g. using multiple regression methods). However, this approach, while an improvement on the single polymorphism method in terms of considering possible interactions between polymorphisms, is limited in power as the number of polymorphic sites increases. The reason is that the number of genetic subgroups that must be compared increases exponentially as the number of polymorphic sites increases. In a medical study of fixed size this has the effect of dramatically increasing the number of groups that must be compared, while reducing the size of each subgroup to a small number. The consequence of these effects is an unacceptable loss of statistical power. Consider, for example, a clinical study of a gene that contains 10 variable sites. If each site is biallelic then there are 210=1024 possible combinations of polymorphic sites. If the study population is 500 subjects then it is likely that many genetically defined subgroups will contain only a small number of subjects. Thus, consideration of multiple polymorphisms (as can be determined from DNA sequence data, for example) does not get at the problem that the DNA sequence from a diploid subject does not sufficiently constrain the sequence of the subject's two chromosomes to be very useful for statistical analysis. Only direct determination of the DNA sequence on each chromosome (a haplotype) can constrain the number of genetic variables in each subject to two (allele 1 and allele 2), while accounting for all, or preferably at least a substantial subset of, the polymorphisms.
A much more powerful measure of variation in a DNA segment, then, is a haplotype—that is, the set of polymorphisms that are found on a single chromosome. Because of the evolutionary history of human populations, only a small fraction of all possible haplotypes (given a set of polymorphic sites at a locus) actually occur at appreciable frequency. For example, in a gene with 10 polymorphic sites only a small fraction—perhaps in the range of 1%—of the 1,024 possible genotypes is likely to exist at a frequency greater than 5% in a human population. Further, as described below, haplotypes can be clustered in groups of related sequences to facilitate genetic analysis. Thus determination of haplotypes is a simplifying step in performing a genetic association study (compared to the analysis of multiple polymorphisms), particularly when applied to DNA segments characterized by many polymorphic sites. There is also a potent biological rationale for sorting genes by haplotype, rather than by genotype at one polymorphic site: polymorphic sites on the same chromosome may interact in a specific way to determine gene function. For example, consider two sites of polymorphism in a gene, both of which encode amino acid changes. The two polymorphic residues may lie in close proximity in three dimensional space (i.e. in the folded structure of the encoded protein). If one of the polymorphic amino acids encoded at each of the two sites has a bulky side chain and the other a small side chain then one can imagine a situation in which proteins that have either [bulky-small], [small-bulky] or [small-small] pairs of polymorphic residues are fully functional, but proteins with [bulky-bulky] residues at the two sites are impaired, on account of a disruptive shape change caused by the interaction of the two bulky side groups. Now consider a subject whose genotype is heterozygous bulky/small at both polymorphic sites. The possible haplotype pairs in such a subject are [bulky-small]/[small-bulky], or [small-small]/[bulky-bulky]. The functional implications of these two haplotype pairs are quite different: active/active or active/inactive, respectively. A genotype test would simply reveal that the subject is doubly heterozygous. Only a haplotype test would reveal the biologically consequential structure of the variation. The interaction of polymorphic sites need not involve amino acid changes, of course, but could also involve virtually any combination of polymorphic sites.
The genetic analysis of complex traits can be made still more powerful by use of schemes to cluster haplotypes into related groups based on parsimony, for example. Templeton and coworkers have demonstrated the power of cladograms for analysis of haplotype data. (Templeton, A. R., Boerwinkle, E. and C. F. Sing. A Cladistic Analysis of Phenotypic Associations With Haplotypes Inferred From Restriction Endonuclease Mapping. I. Basic Theory and an Analysis of Alcohol Dehydrogenase Activity in Drosophila Genetics 117: 343-351, 1987. Templeton, A. R., Crandall, K. A. and C. F. Sing. A Cladistic Analysis of Phenotypic Associations With Haplotypes Inferred From Restriction Endonuclease Mapping and DNA Sequence Data. III. Cladogram Estimation Genetics 132: 619-633, 1992. Templeton, A. R. and C. F. Sing. A Cladistic Analysis of Phenotypic Associations With Haplotypes Inferred From Restriction Endonuclease Mapping. IV. Nested Analyses with Cladogram Uncertainty and Recombination. Genetics 134: 659-669, 1993. Templeton A. R., Clark A. G., Weiss K. M., Nickerson D. A., Boerwinkle E. and C. F. Sing. Recombinational and mutational hotspots within the human lipoprotein lipase gene. Am J Hum Genet. 66: 69-83, 2000). These analyses describe a set of rules for clustering haplotypes into hierarchical groups based on their presumed evolutionary relatedness. This phylogenetic trees can be constructed using standard software packages for phylogenetic analysis such as PHYLIP or PAUP (Felsenstein, J. Phylogenies from molecular sequences: inference and reliability. Annu Rev Genet. 22:521-65, 1988; Retief, J. D. Phylogenetic analysis using PHYLIP. Methods Mol. Biol. 132:243-58, 2000), and hierarchical haplotype clustering can be accomplished using the rules described by Templeton and co-workers. The methods described by Templeton and colleagues further provide for a nested analysis of variance between different haplotype groups at each level of clustering. The results of this analysis can lead to identification of polymorphic sites responsible for phenotypic variation, or at a minimum narrow the possible phenotypically important sites. Thus, methods for determination of haplotypes have great utility in studies designed to test association between genetic variation and variation in phenotypes of medical interest, such as disease risk and prognosis and response to therapy.
Currently available methods for the experimental determination of haplotypes are unsatisfactory, particularly methods for the determination of haplotypes over long distances (e.g. >5 kb). One of the few experimental haplotyping methods currently in use outside the research group that devised it is based on allele specific amplification using oligonucleotide primers that terminate at polymorphic sites (Newton, C. R. et al. Amplification refractory mutation system for prenatal diagnosis and carrier assessment in cystic fibrosis. Lancet. December 23-30; 2 (8678-8679):1481-3, 1989; Newton, C. R. et al., Analysis of any point mutation in DNA. The amplification refractory mutation system (ARMS) Nucleic Acids Res. Vol. 17, 2503-2516, 1989). The method is referred to by the acronym ARMS (for amplification refractory mutation system). The ARMS system was subsequently further developed (Lo, Y. M. et al., Direct haplotype determination by double ARMS: specificity, sensitivity and genetic applications. Nucleic Acids Research July 11; 19 (13):3561-7, 1991) and has since been used in a number of other studies. ARMS is the subject of U.S. Pat. Nos. 5,595,890 and 5,853,989. The drawbacks of this method are that (i) the usual limitations of PCR apply in terms of the difficulty of amplifying long DNA segments; (ii) during amplification cycles, an incompletely extended primer extension product may switch (between one or more cycles) from one allelic template strand to the other, resulting in artifactual hybrid haplotypes; (iii) because different DNA samples will be heterozygous at different combinations of nucleotides, different primers and assay conditions for allele specific amplification must be established for each polymorphic site that is to be haplotyped. For example, consider a locus with five polymorphic sites. Subject A is heterozygours at sites 1, 2 and 4; subject B at sites 2 and 3, and subject C at sites 3 and 5. To haplotype A requires allele specific amplification conditions from sites 1 or 4; to haplotype B requires allele specific amplification conditions from sites 2 or 3, and to haplotype C requires allele specific amplification conditions from sites 3 or 5 (with the allele specific primer from site 3 on the opposite strand from that used to haplotype B).
A similar method for achieving allele specific amplification takes advantage of some thermostable polymerases' ability to proofread and remove a mismatch at the 3′ end of a primer. Again, primers are designed with the 3′ terminal base positioned opposite to the variant base in the template. In this case the 3′ base of the primer is modified in a way that prevents it from being extended by the 5′-3′ polymerase activity of a DNA polymerase. Upon hybridization of the end-blocked primer to the complementary template sequence, the 3′ base is either matched or mismatched, depending on which alleles are present in the sample. If the 3′ base of the primer is properly base paired the polymerase does not remove it from the primer and thus the blocked 3′ end remains intact and the primer can not be extended. However, if there is a mismatch between the 3′ end of the primer and the template, then the 3′-5′ proofreading activity of the polymerase removes the blocked base and then the primer can be extended and amplification occurs. This method suffers from the same limitations described above for the ARMS procedure.
Other allele specific PCR amplification methods include further methods in which the 3′ terminal primer forms a match with one allele and a mismatch with the other allele (U.S. Pat. No. 5,639,611), PCR amplification and analysis of intron sequences (U.S. Pat. No. 5,612,179 and U.S. Pat. No. 5,789,568), or amplification and identification of polymorphic markers in a chromosomal region of DNA (U.S. Pat. No. 5,851,762). Further, methods for allele-specific reverse transcription and PCR amplification to detect mutations (U.S. Pat. No. 5,804,383), and a primer-specific and mispair extension assay to detect mutations or polymorphisms (PCT/CA99/00733) have been described. Several of these methods are directed to genotyping, not to haplotyping.
Other haplotyping methods that have been described are based on analysis of single sperm cells (Hubert R., Stanton, V. P. Jr, Aburatani H, et al. Sperm typing allows accurate measurement of the recombination fraction between D3S2 and D3S3 on the short arm of human chromosome 3. Genomics. 1992 April; 12(4):683-687); on limiting dilution of a DNA sample (until only one template molecule is present in each test tube, on average) (Ruano, G., Kidd, K. K. and J. C. Stephens. Haplotype of multiple polymorphisms resolved by enzymatic amplification of single DNA molecules. Proc Natl Acad Sci USA 1990 August; 87(16):6296-6300), or on cloning DNA into various vectors and host microorganisms (U.S. Pat. No. 5,972,614). These methods are not practical for clinical studies of human subjects, and generally have not been used in studies of human disease risk or drug response. For example, sperm based haplotyping methods are not generally useful for clinical studies because no sperm has the same haplotype as its host. Limiting dilution methods are technically challenging—two rounds of PCR amplification are required, with stringent controls for preventing contamination by exogenous DNA—and not compatible with the high throughput, accuracy and reliability required in human clinical studies.