Mapping of genetic differences between individuals is of growing importance for both forensic and medical applications. For example, DNA "fingerprinting" methods are being applied for identification of perpetrators of crimes where even small amounts of blood or sperm are available for analysis. Biological parents can also be identified by comparing DNAs of a child and a suspected parent using such means. Further, a number of inherited pathological conditions may be diagnosed before onset of symptoms, even in utero, using methods for structural analyses of DNA. Finally, it is notable that a major international effort to physically map and, ultimately, to determine the sequence of bases in the DNA encoding the entire human genome is now underway and gaining momentum in both institutional and commercial settings.
DNA molecules are linear polymers of subunits called nucleotides. Each nucleotide comprises a common cyclic sugar molecule, which in DNA is linked by phosphate groups on opposite sides to the sugars of adjoining nucleotides, and one of several cyclic substituents called bases. The four bases commonly found in DNAs from natural sources are adenine, guanine, cytosine and thymine, hereinafter referred to as A, G, C and T, respectively. The linear sequence of these bases in the DNA of an individual encodes the genetic information that determines the heritable characteristics of that individual.
In double-stranded DNA, such as occurs in the chromosomes of all cellular organisms, the two DNA strands are entwined in a precise helical configuration with the bases projecting inward and so aligned as to allow interactions between bases from opposing strands. The two strands are held together in precise alignment mainly by hydrogen bonds which are permitted between bases by a complementarity of structures of specific pairs of bases. This structural complementarity is determined by the chemical natures and locations of substituents on each of the bases. Thus, in double-stranded DNA, normally each A on one strand pairs with a T from the opposing strand, and, likewise, each G with an opposing C.
When a cell undergoes reproduction, its DNA molecules are replicated and precise copies are passed on to its descendants. The linear base sequence of a DNA molecule is maintained in the progeny during replication in the first instance by the complementary base pairings which allow each strand of the DNA duplex to serve as a template to align free nucleotides with its polymerized nucleotides. The complementary nucleotides so aligned are biochemically polymerized into a new DNA strand with a base sequence that is entirely complementary to that of the template strand.
Occasionally, an incorrect base pairing does occur during replication, which, after further replication of the new strand, results in a double-stranded DNA offspring with a sequence containing a heritable single base difference from that of the parent DNA molecule. Such heritable changes are called genetic mutations, or more particularly in the present case, "single base pair" or "point" mutations. The consequences of a point mutation may range from negligible to lethal, depending on the location and effect of the sequence change in relation to the genetic information encoded by the DNA.
The bases A and G are of a class of compounds called purines, while T and C are pyrimidines. Whereas the normal base pairings in DNA (A with T, G with C) involve one purine and one pyrimidine, the most common single base mutations involve substitution of one purine or pyrimidine for the other (e.g., A for G or C for T), a type of mutation referred to as a "transition". Mutations in which a purine is substituted for a pyrimidine, or vice versa, are less frequently occurring and are called "transversions". Still less common are point mutations comprising the addition or loss of a single base arising in one strand of a DNA duplex at some stage of the replication process. Such mutations are called single base "insertions" or "deletions", respectively, and are also known as "frameshift" mutations, due to their effects on translation of the genetic code into proteins. Larger mutations affecting multiple base pairs also do occur and can be important in medical genetics, but their occurrences are relatively rare compared to point mutations.
Mapping of genetic mutations involves both the detection of sequence differences between DNA molecules comprising substantially identical (i.e., homologous) base sequences, and also the physical localization of those differences within some subset of the sequences in the molecules being compared. In principle, it is possible to both detect and localize limited genetic differences, including point mutations within genetic sequences of two individuals, by directly comparing the sequences of the bases in their DNA molecules. In practice, however, direct DNA sequencing has highly restricted usefulness for mapping mutations due to the major time and effort required to determine the sequence of even one DNA fragment comprising a few hundred base pairs. Typically, a single functional unit of genetic information, a gene, may be encoded in tens of thousands of base pairs of human chromosomal DNA. Thus comparing the sequence of a complete gene from one individual with that of another by direct DNA sequencing involves analyses of multiple short fragments of that gene, requiring many months if not years of effort. It may also be noted that there are estimated to be hundreds of thousands of genes in the entire human gene complement or genome, as it is called, any one of which may be involved in some genetically determined disease.
Accordingly, several simpler methods for detecting differences between DNA sequences have been developed which although providing less direct information about base sequence differences, nevertheless do yield useful observations under limited circumstances. For example, some pairs of single-stranded DNA fragments with sequences differing in a single base may be distinguished by their different migration rates in electric fields, as in denaturing gradient gel electrophoresis. This method does not detect all the possible single-base differences between DNA fragments and is restricted to fragments comprising at most a few hundred base pairs. Further, it is technically difficult to generate consistent analyses using this method. Thus this approach has extremely limited utility for detection and localization of single base sequence differences between DNAs encoding whole genes.
DNA restriction systems found in bacteria, for example, comprise proteins which generally recognize specific sequences in double-stranded DNA composed of 4 to 6 or more base pairs. In the absence of certain modifications (e.g., a covalently attached methyl group) at definite positions within the restriction recognition sequence, endonuclease components of the restriction system will cleave both strands of a DNA molecule at specific sites within or near the recognition sequence. Such short recognition sequences occur by chance in all natural DNA sequences, once in every few hundred or thousand base pairs, depending on the recognition sequence length. Thus, digestion of a DNA molecule with various restriction endonucleases, followed by analyses of the sizes of the resulting fragments (e.g., by gel electrophoresis), may be used to generate a physical map ("fingerprint") of the locations in a DNA molecule of selected short sequences.
It is well known in the art that comparisons of such restriction maps of two homologous DNA sequences can reveal differences within those specific sequences that are recognized by those restriction enzymes used in the available maps. Restriction map comparisons may localize any detectable differences within limits defined ultimately by the resolving power of DNA fragment size determination, essentially within about the length of the restriction recognition sequence under certain conditions of gel electrophoresis. To achieve such resolution in location of a point mutation by restriction mapping, however, all fragments resulting from digestion with each restriction nuclease must be within a range of distinguishable sizes, usually below an upper limit of between 10 and 20 thousand base pairs (kbp), and preferably less than one kbp, using standard gel electrophoresis techniques. Since each different restriction enzyme scans only a fraction of a percent of all the sequences in any DNA molecule, literally thousands of analyses with thousands of different enzymes would be needed to completely compare two DNAs encoding even one gene, assuming that enzymes recognizing all possible 4 to 6 base sequences were known, which they are not.
In practice, selected heritable differences in restriction fragment lengths (i.e., restriction fragment length polymorphisms, "RFLP"s) have been extremely useful, for instance, for generating physical maps of the human genome on which genetic defects may be located with a relatively low precision of hundreds or, sometimes, tens of thousands of base pairs. Typically, RFLPs are detected in human DNA isolated from small tissue or blood samples by using radioactively labeled DNA fragments complementary to the genes of interest. These "probes" are allowed to form DNA duplexes with restriction fragments of the human DNA after separation by electrophoresis, and the resulting radioactive duplex fragments are visualized by exposure to photographic (e.g., X-ray sensitive) film, thereby allowing selective detection of only the relevant gene sequences amid the myriad of others in the genomic DNA.
When the search for DNA sequence differences can be confined to specific regions of known sequence, the recently developed "polymerase chain reaction" ("PCR") technology can be used to reduce the amount of effort needed to detect and locate a single base difference as compared to the usual DNA sequencing approach which requires molecular cloning of the DNA fragment of interest. Briefly, this method utilizes short DNA fragments complementary to sequences on either side of the location to be analyzed to serve as points of initiation for DNA synthesis (i.e., "primers") by purified DNA polymerase. The resulting cyclic process of DNA synthesis results in massive biochemical amplification of the sequences selected for analysis, which then may be easily detected and, if desired, further analyzed, for example, by restriction mapping or direct DNA sequencing methods. In this way, selected regions of a human gene comprising a few kbp may be amplified and examined for sequence variations, but only in cases where sequences spanning a particular location of interest are known.
In clinical practice, the PCR method is of limited utility, for example, in detection of known heritable variants of selected human genes which differ by only one or a few specific base pairs (i.e., allelic forms a gene). For example, the human .beta.-globin gene comprises several alleles that can be distinguished by this approach; but the overall utility is highly limited, particularly when faced with a need to detect sequence differences which may be scattered over large stretches of a gene, as in the diagnosis of conditions resulting from frequent new mutational events in human populations, in the Lesch-Nyan syndrome, for example.
Another known method for detecting and localizing single base differences within homologous DNA molecules involves the use of a radiolabeled RNA fragment with base sequence complementary to one of the DNAs and a nuclease that recognizes and cleaves single-stranded RNA. The structure of RNA is highly similar to DNA, except for a different sugar and the presence of uracil (U) in place of T; hence, RNA and DNA strands with complementary sequences can form helical duplexes ("DNA:RNA hybrids") similar to double-stranded DNA, with base pairing between A's and U's instead of A's and T's. It is known that the enzyme ribonuclease A ("RNase A") can recognize some single pairs of mismatched bases (ie., "base mispairs") in DNA:RNA hybrids and can cleave the RNA strand at the mispair site. Analysis of the sizes of the products resulting from RNase A digestion allows localization of single base mismatches, potentially to the precise sequence position, within lengths of homologous sequences determined by the limits of resolution of the RNA sizing analysis (Myers, R. M. et al., 1985, Science, 230, 1242-1246). RNA sizing is performed in this method by standard gel electrophoresis procedures used in DNA sequencing, an approach which limits the practical resolution to mapping of single base mispairs in a DNA:RNA hybrid comprising an RNA of only several hundred nucleotides. Moreover, this RNAse A method requires preparing complementary RNA probes from each DNA sequence to be examined, which requires more work and is more technically demanding than methods using only DNA (such as restriction mapping). Further, RNase A does not efficiently recognize all possible mispairings of DNA and RNA bases, resulting in a significant inefficiency in detection of all point differences between DNA sequences.
It has also been reported that S1 nuclease, an endonuclease specific for single-stranded nucleic acids, can recognize and cleave limited regions of mismatched base pairs in DNA:DNA or DNA:RNA duplexes. Therefore, it has been suggested that S1 nuclease could be used to map single base pair differences between DNA molecules by sizing of cleavage fragments. However, more extensive analysis of this enzyme has established that a mismatch of at least about 4 consecutive base pairs actually is generally required for recognition and cleavage of a duplex by S1 nuclease, thus precluding its use for detection of any point mutations.
Thus, none of the available methods for comparing the base sequences of DNAs, other than direct sequencing, can efficiently detect and localize all possible single base differences. Further, all of these methods, including especially DNA sequencing, require substantial labor and repetitive analyses with various sequence specific reagents (e.g., multiple nucleases or short nucleic acid strands) to detect all single base differences within two specimens of a single human gene.
Hence, there is a need for simpler and more efficient approaches, both for detecting and for localizing genetic differences between DNA sequences to facilitate both clinical diagnoses and forensic investigations. In particular, the observations above indicate a specific need for simpler and more efficient methods and reagents for detection of any possible single base differences between long DNA sequences, for example, between a complete gene from one individual and the entire genome of another. There is also a further need for simpler methods for localization of any possible single base differences within the sequences of homologous regions of long DNA molecules such as those encoding one or more complete genes and comprising several kbp of DNA.
The present invention contemplates the use of certain proteins that recognize mismatched base pairs in double-stranded DNA (and, therefore, are called "mispair recognition proteins") in defined systems for detecting and mapping point mutations in DNAs. Accordingly, it is an object of the present invention to provide methods for using such mispair recognition proteins, alone or in combination with other proteins, for detecting and localizing single base differences between DNA molecules, particularly those DNAs comprising several kbp. Additionally, it is an object of this invention to develop modified forms of mispair recognition proteins to further simplify methods for identifying specific bases which differ between DNAs.
Enzymatic systems capable of recognition and correction of base pairing errors within the DNA helix have been demonstrated in bacteria, fungi and mammalian cells, but the mechanisms and functions of mismatch correction are best understood in Escherichia coli. Of the several mismatch repair systems that have been identified in E. coli, the most relevant here is the methyl-directed pathway for repair of DNA biosynthetic errors. The fidelity of DNA replication in E. coli is enhanced 100-1000 fold by this postreplication mismatch correction system. This system processes base pairing errors within the helix in a strand-specific manner by exploiting patterns of DNA methylation. Since DNA methylation is a postsynthetic modification, newly synthesized strands temporarily exist in an unmethylated state, with the transient absence of adenine methylation on GATC sequences directing mismatch correction to new DNA strands within the hemimethylated duplexes.
In vivo analyses in E. coli have shown that selected examples of each of the different mismatches are subject to correction with different efficiencies. G-T, A-C, G-G and A-A mismatches are typically subject to efficient repair. A-G, C-T, T-T and C-C are weaker substrates, but well repaired exceptions exist within this class. It is thought that the sequence environment of a mismatched base pair may be an important factor in determining the efficiency of repair in vivo. The mismatch correction system is also capable in vivo of correcting differences between duplexed strands involving a single base insertion or deletion. Further, genetic analyses have demonstrated that the mismatch correction process requires intact genes for several proteins, including the products of the mutH, mutL and mutS genes, as well as DNA helicase II and single-stranded DNA binding protein (SSB).
The present inventors have been seeking to identify and isolate specific proteins that are required for correction of mismatched base pairs and to understand the specific biochemical functions of these mispair correction system components. The products of the mutH and mutS genes have been purified to near homogeneity in biologically active form. Analysis of the MutH protein has suggested that it functions in strand discrimination by incising the unmethylated DNA strand at GATC sites. The isolated MutS protein has been shown to recognize four of the eight possible mismatched base pairs (specifically, G-T, A-C, A-G and C-T mispairs; Su, S. -S. and Modrich, P., 1986, Proc. Nat. Acad. Sci. U.S.A., 84, 5057-5061). The hierarchy of apparent affinities of isolated MutS protein for the particular examples of the four mispairs tested in these studies did not correlate well with in vivo efficiencies of mismatch correction. Hence, these studies left undetermined whether or not additional proteins, acting alone or in concert with MutS, are required for or influence the recognition of other base mispairs.