This invention relates to methods for reducing the complexity of DNA mixtures, subsequent analysis of genetic variations, and isolation of probes or clones of regions of interest.
In 1993 Nelson and associates described a xe2x80x9cgenomic mismatch scanningxe2x80x9d (GMS) method to directly identify identical-by-descent (IBD) sequences in yeast (Nelson, S. F., et al, Nature Genetics, 1993, 4:11-18; this and other papers, books and patents cited herein are expressly incorporated in their entireties by reference). The method allows DNA fragments from IBD regions between two relatives to be isolated based on their ability to form mismatch-free hybrid molecules. The method consists of digesting DNA fragments from two sources with a restriction endonuclease that produces protruding 3xe2x80x2-ends. The protruding 3xe2x80x2-ends provide some protection from exonuclease III (Exo III), which is used in later steps. The two sources are distinguished by methylating the DNA from only one source. Molecules from both sources are denatured and reannealed, resulting in the formation of four types of duplex molecules: homohybrids formed from strands derived from the same source and heterohybrids consisting of DNA strands from different sources. Heterohybrids can either be mismatch-free or contain base-pair mismatches, depending on the extent of identity of homologous regions.
Homohybrids are distinguished from heterohybrids by use of restriction endonucleases that cleave fully methylated or unmethylated GATC sites. Homohybrids are cleaved into smaller duplex molecules. Heterohybrids containing a mismatch are distinguished from mismatch-free molecules by use of the E. coli methyl-directed mismatch repair system. The combination of three proteins of the methyl-directed mismatch repair system MutS, MutL, and MutH (herein collectively called MutSLH) along with ATP introduce a single-strand nick on the unmethylated strand at GATC sites in duplexes that contain a mismatch (Welsh, et al., J Biol. Chem., 1987, 262:15624). Heterohybrids that do not contain a mismatch are not nicked. All molecules are then subjected to digestion by Exo III, which can initiate digestion at a nick, a blunt end, or a recessed 3xe2x80x2-end, to produce single-stranded gaps. Only mismatch-free heterohybrids are not subject to attack by Exo III; all other molecules have single-stranded gaps introduced by the enzyme. Molecules with single-stranded regions are removed by absorption to benzoylated napthoylated DEAE cellulose. The remaining molecules consist of mismatch-free heterohybrids which may represent regions of IBD.
Nelson, et al., used S. cerevisiae hybrids as a model system and showed that sequences shared by two independently generated hybrids from the same parent strains could be identified in many instances. Experiments of this kind are much easier to do in yeast than in humans. The yeast genome is 250 times simpler than the human genome, it contains far fewer repetitive sequences, and genomic sequences of two yeast strains differ more than genomes of unrelated humans. It has thus far not been possible to do comparable experiments with human genomic DNA. In order to do so one needs to use methods to reproducibly generate simplified but highly polymorphic representations of the human genome. Pooling techniques based on mathematical principles are also essential to identify IBD sequences as well as other sequences showing allele frequency differences (AFD) (Shaw, S. H., et al., Genome Research, Cold Spring Harbor Laboratory Press, 1998, 8:111-123).
The human genome is enormously long, at 3xc3x97109 base pairs, and it is far too complex for efficient reannealing of homologous DNA strands after denaturation. The rate of annealing of a mixture of nucleic acid fragments in liquid phase is inversely proportional to the square of their complexity. Efforts have therefore been made to generate simplified representations of the genome for genetic methods based on cross hybridization of homologous sequences from different genomes. The exact degree of simplification of human genomic DNA needed to achieve efficient annealing depends on the conditions of hybridization including total DNA concentration, hybridization buffer, and temperature. In general a 10-100 fold simplification is needed for efficient annealing to occur at high DNA concentrations in high salt aqueous solutions (Lisitsyn, N. A., et al., Science, 1993, 259:946-951).
In some embodiments of the invention, DNA sequences of interest are replicated in rolling circle amplification reactions (RCA). RCA is an isothermal amplification reaction in which a DNA polymerase extends a primer on a circular template (Komberg, A. and Baker, T. A., DNA Replication, W. H. Freeman, New York, 1991). The product consists of tandemly linked copies of the complementary sequence of the template. RCA can be used as a DNA amplification method (Fire, A. and Si-Qun Xu, Proc. Natl. Acad. Sci. USA, 1991, 92:4641-4645; Lui, D., et al. J Am. Chem. Soc., 1995, 118:1587-1594; Lizardi, P. M., et al., Nature Genetics, 1998, 19:225-232). RCA can also be used in a detection method using a probe called a xe2x80x9cpadlock probexe2x80x9d (Nilsson, M., et al., Nature Genetics, 1997, 16: 252-255).
It would be useful to have superior ways of analyzing human DNA and other complex DNA samples.
A general method for screening genomic or cDNA, or fragments and mixtures thereof, involves sample simplification by the generation of subsets and then subjecting the subsets to mismatch scanning procedures. Any given DNA sequence will be represented in one and only one subset, minimizing the number of subsets required to detect a sequence of interest and guaranteeing that all possible sequences can potentially be covered by analyzing all possible subsets. The complexity of DNA sequences is reduced by attaching adapters to the ends of DNA fragments that allow the specific subsets of DNA to be selected and amplified. In some procedures, subsets are generated by replicating DNA in a polymerase chain reaction (PCR) or single primer extension reactions using primers that are complementary to sequences in the adapter and which, at the 3xe2x80x2-end, are complementary to a subset of sequences in the genomic or cDNA.
In another version of this method, DNA fragments are generated by cutting with a restriction endonuclease, such as Bsl1, that generates variable overhangs for which some of the nucleotides can have any of 2 to 4 of the bases A, C, G, or T. In this case, subsets are generated by ligating adapters to the fragment ends that have a specific sequence in the overhang and a primer binding site unique for each adapter. For either of the above methods, Y-shaped adapters can be used having a region of non-complementary single-stranded DNA at the end. Therefore, following ligation, the DNA fragment-plus-adapter construct has the non-complementary region at its ends. Use of Y-shaped adapters make it possible to generate non-overlapping subsets such that a given DNA fragment will only be represented in one of the possible subsets.
Procedures are given for isolating selected subsets from other, contaminating DNAs by using primers that have attached chemical moieties that can be captured on beads, columns, and the like. In some cases, the DNA is then released by cutting specifically designed sequences in the primers with restriction endonucleases. Fragment DNA is protected from these restriction endonucleases by methylation. The DNA subsets obtained are sufficiently reduced in complexity to allow improved analysis of sequence polymorphism by mismatch scanning procedures. Procedures are given for selecting DNA fragments representing regions of low polymorphism or for generating fragments depleted for regions of low polymorphism.
In some embodiments, the DNA fragments are replicated in a rolling circle amplification procedure (RCA; see reviews by Hingorani, M. M., and O""Donnell, M., Current Biology, 1998, 8:R83-86 and by Kelman, Z., et al., Structure, 1998, 6:121-5). The DNA polymerase III holoenzyme (hereafter sometimes denoted DNA pol III) is used in most of these methods to increase the rate and processivity of primer extension. DNA pol III also improves the ability to replicate through a DNA region of high GC content or other obstructions that tend to block DNA polymerases.
A method is also given for selecting heterohybrid DNA that contains one DNA strand derived from each of two different samples or homohybrids in which the DNA strands from different samples have not been recombined. Each DNA sample may consist of some concentration of a unique DNA fragment, or a mixture of fragments, and each sample may be derived from a single individual or more than one individual. The different DNA samples are mixed together, denatured, and then reannealed. Some of the DNA strands will reanneal back together with another strand from the same DNA sample forming a homohybrid. Other DNA strands will reanneal with a DNA strand from a different sample forming a heterohybrid. Adapters attached to the ends of the fragments are designed to allow the selective isolation of homohybrid or heterohybrid DNA. In one method, restriction endonuclease recognition sites are present in the adapters such that homohybrid or heterohybrid DNA can be selectively eliminated depending on the ability of the restriction endonuclease to cut the DNA.