A major challenge for biology and medicine today is the identification of genes implicated in common, complex, human diseases like asthma, type 2 diabetes mellitus, obesity etc. The identification of such genes is usually carried out by performing linkage and/or association studies in large family or patient samples. These studies can be performed using a variety of genetic markers (sequences in the genome which differ between individuals, i.e., polymorphisms). The most widespread polymorphisms used are microsatellite markers consisting of short, specific repeat sequences or single nucleotide polymorphisms (SNPs) that differ in just one nucleotide. Different analysis technologies have been developed to genotype these markers such as, gel-based electrophoresis, DNA hybridization to an ordered array, and identification using mass spectrometry.
A major goal of genetic analysis is to link a phenotype (i.e., a qualitative or quantitative measurable feature of an organism) to a gene or a number of genes. Historically there are two genetic approaches that may be applied to identify genetic loci responsible for a phenotype, familial linkage studies and association studies. Whatever the approach is, the genetic studies are based on polymorphisms, i.e., base differences in the DNA sequence between two individuals at the same genetic locus. The existence of sequence differences for the same genetic locus is called allelic variation and different alleles of a gene can result in different expression of a given phenotype.
Linkage analysis has been the preferred method to identify genes implicated in many diseases both monogenic and multigenic, but where only one gene is implicated for each patient. Linkage analysis follows the inheritance of alleles in a family and attempts to link certain alleles to a phenotype (e.g., a disease). In other words, one looks for shared alleles between individuals with the same phenotype that are identical by descent (IBD), i.e., are derived from the same ancestor. In order to be reasonably powerful for statistical analysis, the studied polymorphisms have to fulfill several criteria such as high heterozygosity (this increases informativity), geneome wide representation, and dectability with standard laboratory methods.
A type of polymorphisms fulfilling most of these criteria is a microsatellite marker. Microsatellite markers are repetitive sequence elements of two (e.g., CA), three or four bases. The number of repetitions is variable for a given locus, resulting in a high number of possible alleles, i.e., high heterozygosity (70–90%). Microsatellites are widely distributed over the genome, and presently, almost 20,000 microsatellite markers have been identified and mapped (coverage ˜0.5–2 mega bases).
Microsatellite markers are the preferred genetic markers for linkage analysis. Genotyping of these markers may be performed by amplifying the alleles by PCR followed by size separation in a gel matrix (slab gel or capillary). For the study of complex human diseases usually about 400–600 microsatellite markers are used that are distributed in regular distances over the whole genome (about every 10–15 mega bases).
There are many advantages associated with familial linkage studies such as established, well mapped marker systems (e.g., microsatellite markers); well developed statistical analysis tools; high informativity; allows for the parallel dissection of several loci involved in a genotype (meta-analysis); and the existence of well developed comparative maps between species.
There are, however, many disadvantages of familial linkage studies. These include high costs (high costs associated with performing multiple polymerase chain reactions, allele scoring, and fluorescent marker labelling), generally slow because although some multiplexing can be achieved, high parallelization is not possible (no microsatellite DNA chips), statistical power is limited to dissecting small effects, results are dependent on allele frequencies and heterozygosity, extensive family collections with affected individuals are necessary (200–2000 individuals), and IBD regions usually extent over large regions unsuitable for direct gene cloning, often 10–15 mega bases (low resolution).
Alternatively, the other approach to genetic analysis relies on association studies. In contrast to linkage studies, which follow alleles in families, association studies follow the evolution of a given allele in a population. The underlying assumption is that at a given time in evolutionary history one polymorphism became fixed to a phenotype because either the polymorphism is itself responsible for a change in phenotype or the polymorphism is physically very close to such an event and is therefore rarely separated from the causative sequence element by recombination (i.e., the polymorphism is in linkage disequilibrium with the causative event). This is a fundamental difference between linkage and association.
In a genetically acquired trait, however, there must be linkage of a sequence to the causative allele. If one could perform an infinitely dense linkage experiment, there is no a priori reason that there might be a single (or very few) causative allele(s) in the population (i.e., there is association). This has major implications on statistical analysis. Many monogenic diseases such as maturity onset diabetes of the young (MODY) where almost each family carries a different mutation in the same gene are examples for linkage without association. In this case, association studies would have failed to identify the locus. As association studies postulate the existence of one given allele for a trait of interest one wants the markers for an association study to be simple. Accordingly, the markers of choice for these studies are single nucleotide polymorphisms (SNPs). These polymorphisms show a single base exchange at a given locus (i.e., they are bi-rarely tri-allelic). Association studies can be carried out either in population samples (cases vs. controls) or family samples (parents and one offspring where the transmitted alleles constitute the “cases” and the non-transmitted the “controls”).
The main advantages of using SNPs for association studies are that SNPs are relatively easy to type (any technology allowing single base discrimination e.g., DNA chips, mass spectrometry), SNPs are very abundant in the human genome (on average one SNP every 300–1000 bases), and association allows for defining a relatively well-delimited genetic interval (usually several kilo bases).
There are many disadvantages, however, associated with using SNPs for association studies. First, associations may only be detected at very high resolutions (unsuitable high number of SNPs must be screened, probably >100.000). Second, as association cannot be postulated to exist a priori, the statistical rules for multiple testing apply (i.e., the result for each additional SNP tested must be corrected for) resulting in an unsuitable high threshold for positive association when thousands of markers are tested or in other words, an inflation of false positive results at nominal significance levels is observed. Therefore, new statistical tools may be needed. Third, association tests are usually carried out as two by two tests (i.e., polymorphisms at a given locus are tested against a phenotype). Fourth, meta-analyzes are difficult if not impossible to carry out for thousands of markers. Fifth, like linkage, association analysis is influenced by allele frequency. Sixth, integrated genetic maps for SNPs do not presently exist. Seventh, large sample collections are needed. And finally, current technology is too expensive to genotype thousands of samples for thousands of SNPs (PCR, costs of chip technology, instrumentation) and discrimination is still not reliable enough (e.g., Affymetrix SNP chip).
Accordingly, there is a need for improved or alternative genetic analysis methods that would overcome the drawbacks of these prior art technologies. In this regard, the ideal genotyping technology should be capable of looking for both linkage and association and, at the same time, avoid the disadvantages of these methods. It should be reliable, allow genome wide analysis, be capable of restraining phenotype-linked loci to small intervals, should be simple to perform and analyze, and be cost effective.
The genomic mismatch scanning (“GMS”) method appears to fulfill most of these requirements. Genomic mismatch scanning was developed in the “mismatch repair community” which had little to do with the human linkage community trying to find the genes involved in human traits. More particularly, in 1993, Nelson et al., described a method that allowed for the detection and quantification of the relationship between different strains of yeast. Nelson et al., 61 Am J Hum Genet., 111–119 (1993). This method consists of mixing the DNAs from different yeast strains and destroying everything that is not identical using a set of mismatch repair enzymes. Apart from the research community working on mismatch repair the article had no major impact. It seemed logical, however, that this technology could also be applied to detect identical regions in humans. In this regard, McAllister et al., published a proof-of-principle article where they described the identification of a human disease locus on chromosome 11 using GMS. McAllister et al., 47 Genomics, 7–11 (1998).
Briefly the method consists of (1) restriction of the DNA from two individuals; labeling one of the DNAs by methylation; (2) mixing of the two DNAs thereby creating a mixture of heteroduplexes between the two DNAs, which are hemimethylated, and homoduplexes of the original DNAs derived through renaturation of each individuals DNA with itself. As the DNA of one individual was completely methylated and the other non-methylated the resulting homoduplexes are also methylated or non-methylated; (3) the non-informative homoduplexes are eliminated by several enzymatic steps involving restriction enzymes that only digest fully methylated or fully unmethylated DNA and a final digestion of the DNA by Exo III nuclease; (4) the remaining heteroduplexes which were formed between the DNAs from the two individuals consist of few fragments which are 100% identical in their sequence composition (the fragments of interest) and those which, due to the heterogeneity between individuals, show sequence differences (i.e., bases are mismatched at those sites); (5) the mismatched DNA fragments are eliminated by using an enzymatic DNA mismatch repair system consisting of three proteins (mut S, mut H, mut L) which recognize these mismatches and cut the DNA strands at a specific recognition sequence (GATC); and (6) the remaining 100% identical DNA heterohybrids can then be identified by specific PCR amplification where the presence or absence of an amplification product is scored.
There are many advantages of the method over the classical linkage and association studies. First, the method allows unambiguous detection of IBD fragments between individuals, as it is not dependent on allele frequencies or marker heterozygosity. Second, the method is not limited on the use of polymorphic markers. Any sequence can be used for scoring as long as some sequence and mapping information is available. No allele discrimination is necessary. The detection signal is digital (i.e., presence or absence of a fragment). Third, the detection method can be scaled to any density. Finally, due to the unambiguous IBD detection and independence of allele frequency, fewer individuals have to be screened (e.g., 100 sib-pairs give the same power to detect regions of linkage as 400–600 sib-pairs in the classical linkage analysis).
The classical GMS methodology, however, has some disadvantages that make its use as a routine tool for genetic screening difficult. First, the amount of DNA for a single experiment is large due to the loss of material throughout the procedure. Usually 5 μg of DNA are needed. Depending on the extraction method this often constitutes more than half the DNA available in a collection. Second, the methylation of one of the DNAs is not 100% efficient, i.e., some of the heteroduplexes can not be distinguished and are lost and some of the homoduplexes of the “methylated” individuals DNA will actually be hemimethylated after the hybridization step and therefore result in background at the detection level (as the DNA from one individual is a priori 100% identical with itself). Third, as exo III nuclease digestion plays a central part in the technology, only restriction enzymes creating 3′ sticky ends can be used for the initial digestion of the DNA (typically Pst I is employed). These enzymes are rare and restrict the choice for the restriction of the DNA and therefore the constitution of the created fragments. Fourth, the procedure described involves multiple handling, tube changing and DNA precipitation steps. Especially the latter makes the procedure cumbersome, error prone and unsuitable for automation, thereby restricting its routine use for large sample cohorts as are typically needed for disease gene identification studies. Also, efficient recognition of non-identical, mismatched DNA sequences by the mut SHL system relies on the presence of the recognition sequence GATC in a given fragment. Absence of the sequence results in background signal due to non-eliminated mismatched DNA. Finally, the labeling of one of the DNAs by methylation allows only a two by two pair-wise comparison between different DNAs.
Indeed, there is a need in the art for genetic analysis techniques and compounds that are more convenient, easy to perfomm, reliable and applicable to broader populations of genetic material. Other objects, features and advantages of the present invention will become apparent from the following detailed description. The detailed description and the specific examples, however, indicate only preferred embodiments of the invention. Various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.