The present invention relates to the preparation of novel DNA probes. Each probe is a synthetic 15 nucleotide sequence comprising 8 G's, 3 T's, 1 C, 1 A and 2 N's (where N is A, C, G or T). The DNA probes thus may comprise a 15 nucleotide sequence that is GGTGGNGATGGCTNG or a randomized variant of that sequence, for example CTGGTGTGGAGGAGG, excluding the M13 consensus sequence GAGGGTGGNGGNTCT, where N is A, C, G or T. The DNA probes are useful in probing human, animal or plant genomes. The invention further relates to an improved method of identifying genomic DNA using such probes, in particular by detecting polymorphisms. The DNA probes of the present invention are useful in many different areas, including the following:
1. paternity and maternity testing; PA1 2. zygosity testing in twins; PA1 3. cell chimerism studies, e.g., detection of donor versus recipient cells after bone marrow transplantation; PA1 4. forensic medicine, e.g., identification of human remains, fingerprinting semen samples from rape victims or blood or hair samples from victim's clothing; PA1 5. family group verification, e.g., in immigration or inheritance disputes; PA1 6. tests for inbreeding; PA1 7. general pedigree analysis; PA1 8. identification of loci of genetic disease, to enable the construction of specific probes to detect a genetic defect; PA1 9. animal or plant breeding and pedigree analysis authentication, e.g., routine control and checking of pure strains, checking pedigrees for litigation, providing genetic markers for economically important traits, checking for genetic relationships in order to prevent inbreeding of strains maintained in zoos; PA1 10. quality control of cell lines, e.g.. checking for contamination and for routine identification; and PA1 11. analysis of tumor cells for molecular genetic abnormalities. PA1 X=A or G; Y=C or T; m=0, 1 or 2; PA1 p=0 or 1; Q=0 or 1.
Two principal methods of identifying genetic variation in genomic DNA are currently available: (1) detection of restriction fragment length polymorphisms (RFLPs), and (2) detection of hypervariable regions (HVRs) of DNA. RFLPs generally result from small-scale changes in DNA, usually base substitutions or microdeletion/insertion, which create or destroy specific restriction endonuclease cleavage sites. Many examples of RFLPs detected by human gene probes or random cloned DNA segments have been reported. (See Cooper and Schmidtke, 1984, Hum. Genet. 66:1-16).
Since the overall variability in human DNA is low with a mean heterozygosity per base pair of .about.0.001-0.002 (Jeffreys et al., 1987, Biochem. Soc. Symp. 53:165-180; UK Patent Application GB 2166445A published May 8, 1986), restriction endonucleases will seldom detect a RFLP at a given locus. Variable sites are not uniformly dispersed; some regions (e.g. HLA gene cluster) are rich in RFLPs, whereas other genes (e.g. thyroglobulin) are markedly deficient in DNA variants. Even when detected, RFLPs are generally dimorphic (i.e., presence or absence of a restriction endonuclease cleavage site) and their usefulness as genetic markers is limited by their low heterozygosity. For a given diallelic marker, the maximum frequency of heterozygotes obtainable in a population in the absence of selection is 50%. (Jeffreys et al., 1987, supra). Thus, in pedigree analysis, all such RFLPs will be uninformative whenever critical individuals are homozygous.
Despite these limitations, RFLPs have provided numerous human genetic markers, which are useful in mapping human chromosomes. (Botstein et al., 1980, Am J. Hum. Genet. 32: 314-331). Most recently, RFLPs have also been used to detect markers linked to disease loci when the gene product of the locus is unknown (e.g., markers linked to Huntington's disease (Gusella et al., 1983, Nature 306:234-238), adult polycystic disease of the kidney (Reeders et al., 1985, Nature 317:542-544), and cystic fibrosis (Tsui et al., 1985, Science 230:1054-1057). Despite this progress, the logistics of detecting linkage with randomly selected markers are formidable. Since the human genome is .about.3300 map units (cM) long, at least 115 uniformly spaced markers would have to be screened before there would be even a 50/50 chance that one marker would be linked within 10 cM of a defined disease locus. (Jeffreys et al., 1987, supra). Because most RFLPs are diallelic and would be uninformative in most pedigrees, the prior odds of detecting linkage between a disease locus and a random marker in a given pedigree are even lower. This second problem might be circumvented by using more highly polymorphic markers.
Recently, localized regions of high variability termed hypervariable regions (HVRs) have been identified in and isolated from human DNA. The availability of probes for such HVRs that show multiallelic variation and correspondingly high heterozygosities would simplify and be useful for genetic analysis. The chance discovery by Wyman and White, 1980, Proc. Natl. Acad. Sci. USA 77:6754-6758, of a random human DNA segment which defined a multiallelic locus was the first direct demonstration that HVRs exist in human DNA. Recently this variable DNA region itself has been cloned (Wyman et al., 1985. Proc. Natl. Acad. Sci. USA 82:2880-2884). Since the initial discovery by Wyman and White, 1980, supra a number of other HVRs have been discovered by chance in human DNA, including: (1) a region 5' to the human insulin gene (Bell et al., 1982, Nature 295:31-35); (2) a region 3' to the c-Ha-rasl oncogene (Capon et al., 1983, Nature 302:33-37); (3) at least 3 HVRs in and around the .alpha.-globin gene cluster (Higgs et al., 1981 Nucleic Acids Res. 9:4213-4224; Proudfoot et al., 1982, Cell 31:553-563; Goodbourn et al., 1983, Proc. Natl. Acad. Sci. 80:5022-5026; Jarman et al., 1986, EMBO J. 5:1857-1863); and (4) a region in the collagen type II gene (Stoker et al., 1985, Nucleic Acids Res. 13:4613). In each example listed above, the HVR consists of tandem repeats of a short sequence (a "minisatellite"). More recently, other minisatellite elements have been discovered in the human factor VII gene (Murray et al., Nucleic Acids Res. 16:4166), 3' to the human apolipoprotein B-100 (Apo B) gene (Huang et al., 1987, J. Biol. Chem. 262:8952-55), in the human apolipoprotein C-II (Apo C-II) gene (Das et al., 1987, J. Biol. Chem. 262:4787-93), and in two loci from the pseudoautosomal region of the human X and Y chromosomes (Simmler et al., 1987, EMBO J. 6:963-69).
According to Jeffreys al., 1987, supra, the hypervariability at minisatellites results from changes in the number of repeats, presumably driven either by unequal recombination between misaligned minisatellites or by slippage at replication forks leading to the gain or loss of repeat units. The resulting length variability may be high. In some cases, a multiplicity of different length alleles may be observed, and the frequency of heterozygotes may approach 100% provided that the restriction enzyme does not cleave the minisatellite unit itself. Detection of the minisatellite length variation (i.e. RFLPs at these HVRs) is independent of the restriction enzyme used, and these loci, therefore, provide ideal markers for human genetics. (Reeders et al., 1985, Nature 317:542-544). Nakamura et al., 1987, Science 235:1616-22 have used the term variable number of tandem repeats (VNTR) locus to designate a single locus which comprises a genetic sequence that contains tandem repeats of that sequence. Thus, the terms HVRs, minisatellites and VNTRs may be used interchangeably to indicate polymorphic regions of DNA in which the polymorphisms are due to variation in the number of tandem repeats of a short DNA sequence.
The total number of hypervariable loci in human DNA is not known, but appears to be large. From a screening of 1680 different recombinants from a human genomic library (Knowlton et al., 1986, Blood 68:378-385), at least 12 clones contained highly polymorphic regions. This would suggest that the human genome could contain .gtoreq.1500 HVRs. These HVRs may provide highly informative markers for the human linkage map, if they can be isolated.
Weller et al., 1984, EMBO J. 3:439-446, have described a small minisatellite comprised of four repeats of a 33 bp sequence found within one of the introns of a human myoglobin gene. A DNA probe comprising tandem repeats of this 33 bp sequence from the myoglobin gene was used to probe the human genome. Polymorphic variation was observed at several different regions in the genomic DNA of 3 related individuals (mother, father, daughter). The length variation was observed in fragments 2-6 kb in size, and was thought to be due to length variation of more than one minisatellite region.
More recently, Jeffreys has described a region of DNA, termed a common core region, which has a high degree of homology with several minisatellites. In PCT Application WO 86/02948 (published May 22, 1986) and European Patent Application 0238329A2 (published Sept. 23, 1987), Jeffreys has disclosed and claimed a DNA probe (and methods of preparing such a probe) which has as its essential constituent this short core sequence (approximately 16 nucleotides) tandemly repeated at least 3 times. According to Jeffreys, a probe having a tandem repeat of such a core sequence is able to detect many different minisatellite regions in genomic DNA. Because the probe detects many minisatellite regions, a fingerprint is obtained which is in essence unique for an individual. Further, according to Jeffreys, previously known probes were only capable of detecting a single minisatellite region and thus incapable of such individual fingerprinting. In particular, the probes disclosed and claimed by Jeffreys contain tandem repeats (at least 3 repeats are required) of a "core" sequence of 6 to 16 nucleotides having a high degree of homology with a nucleotide sequence of the general structure 5'- H.(J.core.K).sub.n.L-3', where n is at least 3 and the core includes any of the following: EQU GGAGGTGGGCAGGAXG (2) EQU AGAGGTGGGCAGGTGG (3) EQU GGAGGYGGGCAGGAGG (4) EQU T(C).sub.m GGAGGAXGG(G)pC (5A) EQU T(C).sub.m GGAGGA(A).sub.Q GGGC (5B)
where
In order to produce "an operable probe," Jeffreys states that the core sequence in itself is insufficient. What is required is to produce a polynucleotide containing tandem repeats (at least 3) of the core sequence or derivatives thereof. Jeffreys' probes are thus segments of minisatellite DNA and may be isolated as minisatellite fragments from human genomic DNA and cloned or may be synthetically prepared minisatellite sequences. Jeffreys, supra, has used these probes for twin zygosity studies. In addition, Min et al., 1988, British J. Haematol. 68:195-201, recently described the use of some of the minisatellite DNA probes described and claimed by Jeffreys in PCT Application 0238329, supra to identify cell origin after bone marrow transplantation.
Most recently, Vassart et al., Science 235:683-684 (1987) and European Patent Application 0264305 (published Apr. 20, 1988) have described a DNA probe derived from a sequence from wild-type M13 bacteriophage that identifies hypervariable minisatellite regions present in the human genome, provided that no competitor DNA is used during hybridization. Fish DNA (e.g. salmon sperm or herring) which is typically used during hybridizations will block or compete with the hybridization of the M13 DNA.
The effective sequence in M13 was identified as two clusters of 15 bp tandem repeats within the protein III gene of M13. The probe disclosed and claimed by Vassart et al., has the following M13 consensus sequence: EQU (GAGGGTGGNGGNTCT).sub.n or (Glu-Gly-Gly-Gly-Ser).sub.n
One of the repeat clusters was isolated as an .about.280 bp HaeIII-ClaI fragment and was used as a probe on Southern blots of HaeIII-digested human or animal DNA. Vassart et al., 1987, supra, showed that the pattern obtained with the .about.280 bp M13 probe was clearly different from that obtained with Jeffreys' minisatellite DNA probe. Westneat et al., 1988, Nucleic Acids Res. 16: 4161, have recently described improved hybridization conditions for the Vassart et al., supra M13 probe and the Jeffreys, supra, minisatellite probes, to eliminate inconsistent hybridization and often high levels of background hybridization.
In addition to cloned DNA probes such as those described by Jeffreys, supra or Vassart et al., supra, oligonucleotide probes have also been used to detect HVRs or VNTR loci. Oligonucleotide probes have several advantages over cloned minisatellite probes since they may be readily synthesized and reduce hybridization and exposure times. However, it is difficult to discover what sequences will function as probes to detect HVR. For example, Schafer et al., 1988, Nucleic Acids Res. 16:5196 investigated the use of 8 different simple repetitive oligonucleotides to screen human DNA. Six of the 8 synthetic oligonucleotides tested were ineffective as probes. The two probes that were effective were the repetitive synthetic oligonucleotides (CAC).sub.5 and (GACA).sub.4. The (GACA).sub.4 probe reported previously by Ali et al., 1986, Hum. Genet. 74:239-43 was less polymorphic than the (CAC).sub.5 probe but was one order of magnitude more sensitive than the (CAC).sub.5 probe.
In contrast to the simple repetitive oligonucleotides used by Schafer et al., 1988, supra and Ali et al., 1986, supra, several groups have started with known sequences of HVRs to prepare synthetic oligonucleotide probes corresponding to these known sequences. For example, in Nakamura et al., 1987, supra. 16-20 oligonucleotide probes were synthesized based on the previously reported sequences of HVRs from myoglobin (Jeffreys et al., 1 1985, Nature 314:67-73) zeta-globin (Proudfoot et al., 1982, Cell 31:553) insulin (Bell et al., 1982, Nature 295:31-35) and the X-gene region of HBV. The probes contained a somewhat variable core sequence GGGGTGGGG and the almost invariant sequence GTGGG.
In subsequent studies, Nakamura et al., 1988, Am J. Hum. Genet. 43:854-59 prepared pools of synthetic 18-base oligonucleotides based on the previously reported sequences from the zeta-globin (Proudfoot et al., 1982, supra), insulin (Bell et al., 1982, supra), myoglobin (Jeffreys et al., 1985, supra), Harvey-ras (Capon et al., 1983, Nature 302:33-37) genes and other loci known to contain HVRs (Nakamura et al., 1987, supra). All 18-base oligonucleotides included GNNGTGGG as a core sequence and the 12 bases outside this core sequence were chosen randomly. In particular, each oligonucleotide used as a probe was actually a pool of 256-1,024 different sequences, because each included 4 or 5 N's (where N is A, G, C or T). In both Nakamura et al., 1987, supra, and Nakamura et al., 1988, supra, the probes were used to screen human genomic libraries for the purpose of identifying locus-specific DNA markers for human gene linkage studies.
The present invention relates to novel DNA probes derived from the M13 consensus sequence, which are randomized variants of the M13 sequence. It has now been unexpectedly found that a 15-nucleotide sequence GGTGGNGATGGCTNG or a randomized variant of this sequence which is not the M13 consensus sequence, detects HVRs in genomic DNA and with such precision as to enable individuals to be identified or fingerprinted by reference to variations in their DNA in these regions. A variety of different restriction endonucleases (including HaeIII, AluI, HinfI, MboI, or SauC13AI) may be used to digest the DNA that is to react with the DNA probes of the present invention. Such an excellent result is highly unexpected, since there was no suggestion that a sequence other than the precise M13 consensus sequence itself tandemly repeated as defined by Vassart et al., supra would be capable of detecting HVRs. In particular, it is impossible to predict whether such a randomized sequence would function as a probe better, worse, the same or not at all as compared with the M13 consensus sequence. The discovery of such randomized non-repetitive sequences and demonstration of such unexpected and excellent results using such sequences lends an unusual degree of unobviousness to the inventive art.
The prototype sequence GGTGGNGATGGCTNG [or Gly-Gly-Asp-Gly-Trp] of the present invention is a significantly different and unique sequence as compared with the prior art M13 consensus sequence. For example, a search analysis of gene and DNA sequences contained in the computer database of genetic sequences known as GenBank, revealed that 1120 matches and 934 loci (loci may contain more than one match) contained the M13 consensus sequence, whereas 1378 matches and 1171 loci contained the prototype sequence. These numbers were obtained from a search of GenBank 60.0 (6/89) allowing 2 base mismatches, with all parameters identical for the M13 sequence and the prototype sequence. When the loci containing the M13 sequence or the prototype sequence of the present invention were further analyzed, only 150 loci were found in common between the M13 and prototype sequences. Surprisingly, there were 784 M13-unique loci and 1021 prototype-unique loci. Similar results may be obtained by searching a GenBank version other than GenBank 60.0 (6/89). The discovery that 1021 gene loci contain only he prototype sequence while 784 different gene loci contain only the M13 sequence suggests that a novel probe of the present invention which comprises the prototype sequence is significantly different from the prior art M13 sequence. These differences have been confirmed in comparative hybridization studies with an M13 probe. The restriction fragment patterns obtained with probes of the present invention are significantly different from patterns obtained with an M13 probe.
A probe according to the present invention is unusually effective as a probe for DNA fingerprinting, and gives clean, clear fingerprints when one of any number of restriction endonucleases is used to digest the DNA sample to be probed. It is also unusually effective in light of the present discovery that the 15-nucleotide sequence alone, and not a tandem repeat of that sequence, is effective as a probe of HVRs in genomic DNA. This is particularly unexpected in light of PCT Application 0238329 (PCT/GB85/00477) which teaches that the mere recognition or identification of a particular 16-nucleotide core sequence is insufficient in itself for the production of an operable probe to detect HVRs; an operable probe requires a polynucleotide containing tandem repeats of the core sequence or a derivative of the core sequence. Similarly, European Patent Application 0264305A2 describes a probe comprising an .about.280 bp HaeIII-ClaI fragment from M13 which contains multiple tandem repeats of the preferred M13 consensus core sequence. It is also unusually effective in that a 15-nucleotide sequence which itself contains no simple repetitive units, is effective as a probe of HVRs in genomic DNA. This is particularly unexpected in light of prior art probes (CAC).sub.5 and (GACA).sub.4 comprising simple repetitive sequences as described by Shafer et al., 1988, supra and Ali et al., 1986, supra.
In a particularly preferred embodiment of the present invention, the DNA probe is a 15-nucleotide sequence that is GGTGGAGATGGCTGG. This sequence was itself randomized 35 times, producing 35 related sequences with identical base content but permuted sequences. Thus, a DNA probe according to the present invention is a simple and extremely sensitive detector of DNA polymorphisms, that works by detecting RFLPs of HVRs. In contrast to the probes of Jeffreys, supra, and Vassart et al., supra, it is effective as a 15-nucleotide (15-mer) and does not require a tandem repeat of the 15-mer to be operative as a probe that detects DNA polymorphisms.
For purposes of the present invention as disclosed and claimed herein, the following terms are defined below:
Base pair (bp) or Nucleotide (nt) - used synonymously. Both can refer to DNA or RNA. The abbreviations A,C,G,T or U refer conventionally to the nucleotides (deoxy)adenosine, (deoxy)cytidine, (deoxy)guanosine, and thymidine or uridine monophosphates. In double-stranded DNA, base pair may refer to a partnership of A with T or C with G.
Consensus core sequence a sequence which can be identified as the closest match among a number of repeat sequences (e.g. among the repeat units of two or more different minisatellites).
Hypervariable Regions (HVRs) - a region of human animal or plant DNA at a recognized locus or site which occurs in many different forms, for example, as to length or sequence.
Minisatellite - a variable region of human, animal or plant DNA which is comprised of tandem repeats of a short DNA sequence, in which all repeats may not necessarily show perfect identity of sequence and in which the number of repeats may vary among different individuals.
M13 consensus sequence a sequence tandemly repeated in M13 genomic DNA that is GAGGGTGGNGGNTCT, where N is A, C, G, or T.
% Similarity - in comparing two sequences A and B (e.g. two tandem repeats or repeat sequences), the percentage similarity is given by the number of base pairs in A, less the number of base pair substitutions, additions or deletions in B, which would be necessary in order to give the sequence of A, expressed as a percentage. For example, the % similarity between two sequences ATGC and AGC is 75% (4-1=3 and 3/4=75%).
Polymorphic/Polymorphisms - a gene or other segment of DNA which shows variability from individual to individual or between a given individual's paired chromosomes (e.g., a heterozygous individual).
Restriction Fragment - any linear DNA molecule generated by the action of one or more restriction enzymes.
Restriction Fragment Length Polymorphism(s) (RFLPs)-a polymorphism revealed by digestion of DNA with a restriction enzyme.
Tandem Repeat or Repeat Sequence - a polynucleotide sequence which is perfectly or imperfectly repeated in series.