The following is a discussion of relevant art, none of which is admitted to be prior art to the appended claims.
Mapping of genetic differences between individuals is of growing importance for both forensic and medical applications. For example, DNA "fingerprinting" methods are being applied for identification of perpetrators of crimes where even small amounts of blood or sperm are available for analysis. Biological parents can also be identified by comparing DNAs of a child and a suspected parent using such means. Further, a number of inherited pathological conditions may be diagnosed before onset of symptoms, even in utero, using methods for structural analyses of DNA. Finally, it is notable that a major international effort to physically map and, ultimately, to determine the sequence of bases in the DNA encoding the entire human genome is now underway and gaining momentum in both institutional and commercial settings.
DNA molecules are linear polymers of subunits called nucleotides. Each nucleotide comprises a common cyclic sugar molecule, which in DNA is linked by phosphate groups on opposite sides to the sugars of adjoining nucleotides, and one of several cyclic substituents called bases. The four bases commonly found in DNAs from natural sources are adenine, guanine, cytosine and thymine, hereinafter referred to as A, G, C and T, respectively. The linear sequence of these bases in the DNA of an individual encodes the genetic information that determines the heritable characteristics of that individual.
In double-stranded DNA, such as occurs in the chromosomes of all cellular organisms, the two DNA strands are entwined in a precise helical configuration with the bases projecting inward and so aligned as to allow interactions between bases from opposing strands. The two strands are held together in precise alignment mainly by hydrogen bonds which are permitted between bases by a complementarity of structures of specific pairs of bases. This structural complementarity is determined by the chemical natures and locations of substituents on each of the bases. Thus, in double-stranded DNA, normally each A on one strand pairs with a T from the opposing strand, and, likewise, each G with an opposing C.
When a cell undergoes reproduction, its DNA molecules are replicated and precise copies are passed on to its descendants. The linear base sequence of a DNA molecule is maintained in the progeny during replication in the first instance by the complementary base pairings which allow each strand of the DNA duplex to serve as a template to align free nucleotides with its polymerized nucleotides. The complementary nucleotides so aligned are biochemically polymerized into a new DNA strand with a base sequence that is entirely complementary to that of the template strand.
Occasionally, an incorrect base pairing does occur during replication, which, after further replication of the new strand, results in a double-stranded DNA offspring with a sequence containing a heritable single base difference from that of the parent DNA molecule. Such heritable changes are called genetic mutations, or more particularly in the present case, "single base pair" or "point" mutations. The consequences of a point mutation may range from negligible to lethal, depending on the location and effect of the sequence change in relation to the genetic information encoded by the DNA.
The bases A and G are of a class of compounds called purines, while T and C are pyrimidines. Whereas the normal base pairings in DNA (A with T, G with C) involve one purine and one pyrimidine, the most common single base mutations involve substitution of one purine or pyrimidine for the other (e.g., A for G or C for T or vice versa), a type of mutation referred to as a "transition". Mutations in which a purine is substituted for a pyrimidine, or vice versa, are less frequently occurring and are called "transversions". Still less common are point mutations comprising the addition or loss of a small number (1, 2 or 3) of nucleotides arising in one strand of a DNA duplex at some stage of the replication process. Such mutations are called small "insertions" or "deletions", respectively, and are also known as "frameshift" mutations in the case of insertion/deletion of one of two nucleotides, due to their effects on translation of the genetic code into proteins. Mutations involving larger sequence rearrangement also do occur and can be important in medical genetics, but their occurrences are relatively rare compared to the classes summarized above.
Mapping of genetic mutations involves both the detection of sequence differences between DNA molecules comprising substantially identical (i.e., homologous) base sequences, and also the physical localization of those differences within some subset of the sequences in the molecules being compared. In principle, it is possible to both detect and localize limited genetic differences, including point mutations within genetic sequences of two individuals, by directly comparing the sequences of the bases in their DNA molecules.
Other methods for detecting differences between DNA sequences have been developed. For example, some pairs of single-stranded DNA fragments with sequences differing in a single base may be distinguished by their different migration rates in electric fields, as in denaturing gradient gel electrophoresis.
DNA restriction systems found in bacteria for example, comprise proteins which generally recognize specific sequences in double-stranded DNA composed of 4 to 6 or more base pairs. In the absence of certain modifications (e.g., a covalently attached methyl group) at definite positions within the restriction recognition sequence, endonuclease components of the restriction system will cleave both strands of a DNA molecule at specific sites within or near the recognition sequence. Such short recognition sequences occur by chance in all natural DNA sequences, once in every few hundred or thousand base pairs, depending on the recognition sequence length. Thus, digestion of a DNA molecule with various restriction endonucleases, followed by analyses of the sizes of the resulting fragments (e.g., by gel electrophoresis), may be used to generate a physical map ("fingerprint") of the locations in a DNA molecule of selected short sequences.
Comparisons of such restriction maps of two homologous DNA sequences can reveal differences within those specific sequences that are recognized by those restriction enzymes used in the available maps. Restriction map comparisons may localize any detectable differences within limits defined ultimately by the resolving power of DNA fragment size determination, essentially within about the length of the restriction recognition sequence under certain conditions of gel electrophoresis.
In practice, selected heritable differences in restriction fragment lengths (i.e., restriction fragment length polymorphisms, "RFLP"s) have been extremely useful, for instance, for generating physical maps of the human genome on which genetic defects may be located with a relatively low precision of hundreds or, sometimes, tens of thousands of base pairs. Typically, RFLPs are detected in human DNA isolated from small tissue or blood samples by using radioactively labeled DNA fragments complementary to the genes of interest. These "probes" are allowed to form DNA duplexes with restriction fragments of the human DNA after separation by electrophoresis, and the resulting radioactive duplex fragments are visualized by exposure to photographic (e.g., X-ray sensitive) film, thereby allowing selective detection of only the relevant gene sequences amid the myriad of others in the genomic DNA.
When the search for DNA sequence differences can be confined to specific regions of known sequence, the recently developed "polymerase chain reaction" ("PCR") technology can be used. Briefly, this method utilizes short DNA fragments complementary to sequences on either side of the location to be analyzed to serve as points of initiation for DNA synthesis (i.e., "primers") by purified DNA polymerase. The resulting cyclic process of DNA synthesis results in massive biochemical amplification of the sequences selected for analysis, which then may be easily detected and, if desired, further analyzed, for example, by restriction mapping or direct DNA sequencing methods. In this way, selected regions of a human gene comprising a few kbp may be amplified and examined for sequence variations.
Another known method for detecting and localizing single base differences within homologous DNA molecules involves the use of a radiolabeled RNA fragment with base sequence complementary to one of the DNAs and a nuclease that recognizes and cleaves single-stranded RNA. The structure of RNA is highly similar to DNA, except for a different sugar and the presence of uracil (U) in place of T; hence, RNA and DNA strands with complementary sequences can form helical duplexes ("DNA:RNA hybrids") similar to double-stranded DNA, with base pairing between A's and U's instead of A's and T's. It is known that the enzyme ribonuclease A ("RNase A") can recognize some single pairs of mismatched bases (i.e., "base mispairs") in DNA:RNA hybrids and can cleave the RNA strand at the mispair site. Analysis of the sizes of the products resulting from RNase A digestion allows localization of single base mismatches, potentially to the precise sequence position, within lengths of homologous sequences determined by the limits of resolution of the RNA sizing analysis (Myers, R. M. et al., 1985, Science, 230, 1242-1246). RNA sizing is performed in this method by standard gel electrophoresis procedures used in DNA sequencing.
S1 nuclease, an endonuclease specific for single-stranded nucleic acids, can recognize and cleave limited regions of mismatched base pairs in DNA:DNA or DNA:RNA duplexes. A mismatch of at least about 4 consecutive base pairs actually is generally required for recognition and cleavage of a duplex by S1 nuclease.
Ford et al., (U.S. Pat. No. 4,794,075) disclose a chemical modification procedure to detect and localize mispaired guanines and thymidines and to fractionate a pool of hybrid DNA from two samples obtained from related individuals. Carbodiimide is used to specifically derivatize unpaired G's and T's, which remain covalently associated with the DNA helix.
The present invention concerns use of proteins that function biologically to recognize mismatched base pairs in double-stranded DNA (and, therefore, are called "mispair recognition proteins") and their application in defined systems for detecting and mapping point mutations in DNAs. Accordingly, it is an object of the present invention to provide methods for using such mispair recognition proteins, alone or in combination with other proteins, for detecting and localizing base pair mismatches in duplex DNA molecules, particularly those DNAs comprising several kbp, and manipulating molecules containing such mismatches. Additionally, it is an object of this invention to develop modified forms of mispair recognition proteins to further simplify methods for identifying specific bases which differ between DNAs. The following is a brief outline of the art regarding mispair recognition proteins and systems, none of which is admitted to be prior art to the present invention.
Enzymatic systems capable of recognition and correction of base pairing errors within the DNA helix have been demonstrated in bacteria, fungi and mammalian cells, but the mechanisms and functions of mismatch correction are best understood in Escherichia coli. One of the several mismatch repair systems that have been identified in E. coli is the methyl-directed pathway for repair of DNA biosynthetic errors. The fidelity of DNA replication in E. coli is enhanced 100-1000 fold by this post-replication mismatch correction system. This system processes base pairing errors within the helix in a strand-specific manner by exploiting patterns of DNA methylation. Since DNA methylation is a post-synthetic modification, newly synthesized strands temporarily exist in an unmethylated state, with the transient absence of adenine methylation on GATC sequences directing mismatch correction to new DNA strands within the hemimethylated duplexes.
In vivo analyses in E. coli have shown that selected examples of each of the different mismatches are subject to correction with different efficiencies. G-T, A-C, G-G and A-A mismatches are typically subject to efficient repair. A-G, C-T, T-T and C-C are weaker substrates, but well repaired exceptions exist within this class. The sequence environment of a mismatched base pair may be an important factor in determining the efficiency of repair in vivo. The mismatch correction system is also capable in vivo of correcting differences between duplexed strands involving a single base insertion or deletion. Further, genetic analyses have demonstrated that the mismatch correction process requires intact genes for several proteins, including the products of the mutH, mutL and mutS genes, as well as DNA helicase II and single-stranded DNA binding protein (SSB). The following are further examples of art discussing this subject matter.
Lu et al., 80 Proc. Natl. Acad. Sci. USA 4639, 1983 disclose the use of a soluble E. coli system to support mismatch correction in vitro.
Pang et al., 163 J. Bact. 1007, 1985 disclose cloning of the mutS and mutL genes of Salmonella typhimurium.
The specific components of the E. coli mispair correction system have been isolated and the biochemical functions determined. Preparation of MutS protein substantially free of other proteins has been reported (Su and Modrich, 1986, Proc. Nat. Acad. Sci. U.S.A., 84, 5057-5061, which is hereby incorporated herein by reference). The isolated MutS protein was shown to recognize four of the eight possible mismatched base pairs (specifically, G-T, A-C, A-G and C-T mispairs.
Su et al., 263 J. Biol. Chem. 6829, 1988 disclose that the mutS gene product binds to each of the eight base pair mismatches and does so with differential efficiency.
Jiricny et al., 16 Nucleic Acids Research 7843, 1988 disclose binding of the muts gene product of E. coli to synthetic DNA duplexes containing mismatches to correlate recognition of mispairs and efficiency of correction in vivo. Nitrocellulose filter binding assays and band-shift assays were utilized.
Welsh et al., 262 J. Biol. Chem. 15624, 1987 purified the product of the MutH gene to near homogeneity and demonstrated the MutH gene product to be responsible for d(GATC) site recognition and to possess a latent endonuclease that incises the unmethylated strand of hemimethylated DNA 5' to the G of d(GATC) sequences.
Au et al., 267 J. Biol. Chem. 12142, 1992 indicate that activation of the MutH endonuclease requires MutS, MutL and ATP.
Grilley et al. 264 J. Biol. Chem. 1000, 1989 purified the E. coli mutL gene product to near homogeneity and indicate that the mutL gene product interacts with MutS heteroduplex DNA complex.
Lahue et al., 245 Science 160, 1989 delineate the components of the E. coli methyl-directed mismatch repair system that function in vitro to correct seven of the eight possible base pair mismatches. Such a reconstituted system consists of MutH, MutL, and MutS proteins, DNA helicase II, single-strand DNA binding protein, DNA polymerase III holoenzyme, exonuclease I, DNA ligase, ATP, and the four deoxyribonucleoside triphosphates.
Su et al., 31 Genome 104, 1989 indicate that under conditions of restricted DNA synthesis, or limiting concentration of dNTPs, or by supplementing a reaction with a ddNTP, there is the formation of excision tracts consisting of single-stranded gaps in the region of the molecule containing a mismatch and a d(GATC) site.
Grilley et al. 268 J. Biol. Chem. 11830, 1993, indicate that excision tracts span the shorter distance between a mismatch and the d(GATC) site, indicating a bidirectional capacity of the methyl-directed system.
Holmes et al., 87 Proc. Natl. Acad. Sci. USA, 5837, 1990, disclose nuclear extracts derived from Hela and Drosophila melanogaster K.sub.c cell lines to support strand mismatch correction in vitro.
Cooper et al., 268 J. Biol. Chem., 11823, 1993, describe a role for RecJ and Exonuclease VII as a 5' to 3' exonuclease in a mismatch repair reaction. In reconstituted systems such a 5' to 3' exonuclease function had been provided by certain preparations of DNA polymerase III holoenzyme.
Au et al., 86 Proc. Natl. Acad. Sci. USA 8877, 1989 describe purification of the MutY gene product of E. coli to near homogeneity, and state that the MutY protein is a DNA glycosylase that hydrolyzes the glycosyl bond linking a mispaired adenine (G-A) to deoxyribose. The MutY protein, an apurinic endonuclease, DNA polymerase I, and DNA ligase were shown to reconstitute G-A to G-C mismatch correction in vitro.
A role for the E. coli mismatch repair system in controlling recombination between related but non allelic sequences has been indicated (Feinstein and Low, 113 Genetics 13, 1986; Rayssiguier, 342 Nature 396, 1989; Shen, 218 Mol. Gen. Genetics 358, 1989; Petit, 129 Genetics 327, 1991). The frequency of crossovers between sequences which differ by a few percent or more at the base pair level are rare. In bacterial mutants deficient in methyl-directed mismatch repair, the frequency of such events increases dramatically. The largest increases are observed in MutS and MutL deficient strains. (Rayssiguier, supra; and Petit, supra.)
Nelson et al., 4 Nature Genetics 11, 1993, disclose a genomic mismatch (GMS) method for genetic linkage analysis. The method allows DNA fragments from regions of identity-by-descent between two relatives to be isolated based on their ability to form mismatch-free hybrid molecules.
The method consists of digesting DNA from the two sources with a restriction endonuclease that produces protruding 3' ends. The protruding 3' ends provide some protection from exonuclease III, which is used in later steps. The two sources are distinguished by methylating the DNA from only one source. Molecules from both sources are denatured and reannealed, resulting in the formation of four types of duplex molecules: homohybrids formed from strands derived from the same source and heterohybrids consisting of DNA strands from different sources. Heterohybrids can either be mismatch free or contain base-pair mismatches, depending on the extent of identity of homologous resins.
Homohybrids are distinguished from heterohybrids by use of restriction endonucleases that cleave at fully methylated or unmethylated GATC sites. Homohybrids are cleaved to smaller duplex molecules, while heterohybrid are resistant to cleavage. Heterohybrids containing a mismatch(es) are distinguished from mismatch free molecules by use of the E. coli methyl-directed mismatch repair system. The combination of three proteins of the methyl-directed mismatch repair system MutH, MutL, and MutS along with ATP introduce a single-strand nick on the unmethylated strand at GATC sites in duplexes that contain a mismatch. Heterohybrids that do not contain a mismatch are not nicked. All molecules are then subject to digestion by Exonuclease III (Exo III), which can initiate digestion at a nick, a blunt end or a 5' overhang, to produce single-stranded gaps. Only mismatch free heterohybrids are not subject to attack by Exo III, all other molecules have single-stranded gaps introduced by the enzyme. Molecules with single-stranded regions are removed by absorption to benzoylated napthoylated DEAE cellulose. The remaining molecules consist of mismatch-free heterohybrids which may represent regions of identity by decent.