A general method is described for screening cDNAs, genes or genome segments to directly isolate and characterize sequences associated with particular phenotypes. In the case of the human genome, a simplification of the starting material is needed, and a specific method to generate highly polymorphic genome subsets for this purpose is presented. The general screening method identifies DNA sequences containing allele frequency differences when groups with dissimilar phenotypes are compared. The approach is based on mathematical principles of inequality. A change in the abundance ratio of homoduplexes of perfectly matched sequences to heteroduplexes of perfectly matched sequences, or, conversely, of mismatched homoduplexes to mismatched heteroduplexes, serves as an indicator of allele frequency difference.
This invention relates to screening complex reannealed DNA preparations to identify sequences exhibiting differences in allele frequency when phenotypically different groups are compared. The DNA material can originate from genes, genomes, or cDNA. For human applications, simplification of genomic DNA is needed, and a way to generate genome subsets is described. Genome subsets generated in this manner are enriched for polymorphic sequences and sufficiently reduced in complexity to allow reannealing, a prerequisite for the invention.
Most common diseases of humans are not inherited as single gene traits but rather result from complex interactions between one or several genes and the environment. Current methods of identifying disease genes under these circumstances are inefficient and this is one of the major limitations of modern medical genetics. Promising methods for identifying genes affecting complex diseases are affected relative analysis, linkage disequilibrium analysis and association studies. All require genotyping a very high density of markers for genome-wide search making these methods impractical for use with current genotyping techniques (1-3). These requirements make it difficult to rapidly identify genes affecting complex diseases.
The term xe2x80x9cphenotype cloningxe2x80x9d was developed to describe the isolation of genes by virtue of their effect and without requiring prior knowledge of their bio-chemical function or map position (4). Phenotype cloning methods are based on inferences about characteristics of the unknown gene(s), and these characteristics then form the basis to directly isolate the gene in question. As an excellent example, of inferences which can be made about disease genes to allow their direct isolation is the prediction that somatic mutations in tumors will sometimes result in the loss or generation of genoric restriction fragments. A method called representational difference analysis (RDA) has been developed by combining genomic representation, subtractive enrichment, and kinetic enrichment to detect short restriction fragments present in a xe2x80x9ctargetxe2x80x9d genome but not in another xe2x80x9cdriverxe2x80x9d genome (5). This method has been used to directly isolate genetic elements associated with tumor formation (6). In addition, RDA has been used to detect autosomal recessive loci in F2 progeny from crossing two inbred strains of laboratory mice (7), but it lacks sufficient power to isolate fragments associated with inherited traits in outbred humans.
A second example of phenotype cloning which would have many important applications, especially in identifying genes in linkage disequilibrium and hence genes affecting complex diseases, is the direct isolation of genomic sequences that are identical-by-descent (IBD) in a group of patients with the same disease. Sanda and Ford outlined the genetic basis for such methods in the case of autosomal dominant disease (8). They pointed out that genomic segments from two unrelated individuals should contain sequence differences due to polymorphisms. In contrast, IBD sequences, that two relatives have in common, would be identical since the mutation rate in humans is very low. Lastly, segregation and recombination result in genomic IBD sequences among relatives becoming fewer and shorter with increasing number of meiosis separating the individuals. IBD sequences shared by distant relatives affected with the same genetic disease should therefore contain the disease gene.
An example where isolation of IBD sequences could be used to identify disease genes is in autosomal recessive disease if the patients come from a small, isolated, homogeneous population, and the disease is unusually frequent in the population. In that setting one can assume that there is a founder effect and that the disease gene is IBD (FIG. 1). A common reason why disease genes are in linkage disequilibrium is that they are IBD, so an extension of this approach might allow identification of genes in linkage disequilibrium that are affecting common complex diseases in isolated homogenous populations. However, not all patients with complex diseases would be homozygous for the disease gene. Rather patients would be more likely to be heterozygotes or homozygotes for a gene predisposing to disease and conversely less likely to carry genes conveying a protective effect. Methods capable of detecting quantitative differences in allele frequencies, i.e., allele frequency difference between patients and normal controls therefore is essential in studying genetics of complex disease.
In 1993 Nelson and associates described xe2x80x9cgenomic mismatch scanningxe2x80x9d method to directly identify IBD sequences in yeast (9). They used S. cerevisiae hybrids as a model system and showed that sequences shared by two independently generated hybrids from the same parent strains could be identified in many instances. Experiments of this kind are much easier to do in yeast than humans. The yeast genome is 250 times simpler than the human genome, it contains far fewer repetitive sequences, and genomic sequences of two yeast strains differ more than genomes of unrelated humans. It has thus far not been possible to do comparable experiments with human genomic DNA. In order to do so one needs to use methods to reproducibly generate simplified but highly polymorphic representations of the human genome. Pooling techniques based on the mathematical principles outlined below are also essential to identify IBD sequences as well as other sequences showing AFD.
The human genome is enormously long 3xc3x97109 base pairs and it is far too complex for efficient reannealing of homologous DNA strands after denaturation. The rate of annealing of a mixture of nucleic acid fragments in liquid phase is inversely proportional to their complexity. Efforts have therefore been made to generate simplified representations of the genome for genetic methods based on cross hybridization of homologous sequences from different genomes. The exact degree of simplification of human genomic DNA needed to achieve efficient annealing depends on the conditions of hybridization including total DNA concentration, hybridization buffer, and temperature. In general a 10-100 fold simplification is needed for efficient annealing to occur at high DNA concentrations in high salt aqueous solutions (5).
Ideal representations for cross hybridizations studies on human material should therefore be at least 10-100 fold simplification of genomic DNA. They should contain sequences representing many thousands of different loci that are evenly distributed throughout the genome. In addition, the representations should be enriched for highly polymorphic sequences to facilitate genetic studies. Lastly, one should be able to easily and reproducibly generate equivalent representations from genomes of different individuals.
It is an object of the invention to provide highly polymorphic representations of the human genome.
It is another object of the invention to provide a widely applicable method for phenotype cloning based on allele frequency differences.
These and other objects are accomplished by the present invention, which provides genomic DNA fragments that are enriched for polymorphic sequences and sufficiently reduced in overall complexity to permit effective reannealing of homologous segments so that they can be used in detecting allele frequency differences as well as in genomic mismatch scanning. In a typical method of the invention, DNA sequences in cDNAs, genes or genomic segments are screened to isolate and characterize sequences associated with particular phenotypes by comparing the abundance of homoduplexes of perfectly matched sequences in the sample with heteroduplexes of perfectly matched sequences, or comparing the abundance of mismatched homoduplexes in the sample with mismatched heteroduplexes. As described hereafter, other genomic subsets are also suitable for allele frequency difference screening.
In the practice of a method of the invention DNA sequences from complex DNA sample pools, where allele frequency differs between the pools, are identified by mixing at least two different complex DNA samples together to generate a pool, annealing specific adaptors to DNA fragments in the pools, removing excess adaptors that are not ligated to the DNA fragments, mixing at least two different pools together, denaturing the mixed pools of DNA samples, reannealing the denatured pools of DNA samples to obtain DNA duplexes containing homologous strands, separating perfectly matched DNA duplexes in the pools from duplexes containing mismatched base pairs or insertion/deletion loops, and selectively amplifying either perfectly matched or mismatched DNA homoduplexes and heteroduplexes.
In one embodiment, a highly polymorphic subset of human genome DNA is generated by (a) digesting a genomic DNA sample with a restriction enzyme to obtain genomic DNA fragments; (b) ligating adaptors capable of binding to ends of the genomic fragments; (c) removing excess adaptors that are not ligated to genomic DNA fragments obtained in step (b); (d) subjecting the genomic DNA-adaptor preparation so produced to a controllable initiating reaction for a PCR reaction in the presence of a DNA primer complementary to the consensus sequence for the 3xe2x80x2 end of Alu repeat sequences in the sample, and then a PCR amplification with Alu 3xe2x80x2 end primer and 5xe2x80x2 adaptor primer; (e) digesting the products produced in the amplification reaction of step (d) with restriction enzymes with cognate sequences built into the primer sequences to generate DNA fragments with asymmetric overhangs; and (f) isolating DNA subsets exhibiting selectively amplified sequences flanking 3xe2x80x2 ends of Alu repeats produced in step (e).