The ability to map eukaryotic genomes has become an essential tool for the diagnosis of genetic diseases, and for plant breeding and forensic medicine. An absolute requirement for elucidation of any genetic linkage map is the ability to identify DNA sequence variation. The realization that genetic (DNA) polymorphisms between phenotypically identical individuals are present and can be used as markers for genetic mapping has produced major advances in the art of developing eukaryotic linkage maps.
Techniques for identifying genetic polymorphisms are relatively few and to date have been time consuming and labor intensive. One of the most common techniques is referred to as restriction fragment length. polymorphism or RFLP (Botstein et al. Am. J. Hum. Genet. 342, 314, (1980)). Using RFLP technology, genetic markers based on single or multiple point mutations in the genome may be detected by differentiating DNA banding patterns from restriction enzyme analysis. As restriction enzymes cut DNA at specific target site sequences, a point mutation within this site may result in the loss or gain of a recognition site, giving rise in that genomic region to restriction fragments of different length. Mutations caused by the insertion, deletion or inversion of DNA stretches will also lead to a length variation of DNA restriction fragments. Genomic restriction fragments of different lengths between genotypes can be detected with region-specific probes on Southern blots (Southern, E. M., J. Mol. Biol. 98, 503, (1975). The genomic DNA is typically digested with nearly any restriction enzyme of choice. The resulting fragments are electrophoretically size-separated, transferred to a membrane, and then hybridized against a suitably labelled probe for detection of fragments corresponding to a specific region of the genome. RFLP genetic markers are particularly useful in detecting genetic variation in phenotypically silent mutations and serve as highly accurate diagnostic tools. RFLP analysis is a useful tool in the generation of codominant genetic markers but suffers from the need to separate restriction fragments electrophoretically and often requires a great deal of optimization to achieve useful background to signal ratios where significant polymorphic markers can be detected. In addition, the RFLP method relies on DNA polymorphisms existing within actual restriction sites. Any other point mutations in the genome usually go undetected. This is a particularly difficult problem when assaying genomes with inherently low levels of DNA polymorphism. Thus, RFLP differences often are difficult to identify.
Another method of identifying polymorphic genetic markers employs DNA amplification using short primers of arbitrary sequence. These primers have been termed `random amplified polymorphic DNA`, or "RAPD" primers, Williams et al., Nucl. Acids. Res., 18, 6531 (1990) and U.S. Pat. Nos. 5,126,239; (also EP0 543 484 A2, WO 92/07095, WO 92/07948, WO 92/14844, and WO 92/03567). The RAPD method amplifies either double or single stranded nontargeted, arbitrary DNA sequences using standard amplification buffers, dATP, dCTP, dGTP and TTP nucleotides, and a thermostable DNA polymerase such as Taq polymerase. The nucleotide sequence of the primers is typically about 9 to 13 bases in length, between 50 and 80% G+C in composition and contains no palindromic sequences. Differences as small as single nucleotides between genomes can affect the RAPD primer's binding/target site, and a PCR product may be generated from one genome but not from another. RAPD detection of genetic polymorphisms represents an advance over RFLP in that it is less time consuming, more informative, and readily adaptable to automation. The use of the RAPD assay is limited, however, in that only dominant polymorphisms can be detected; this method does not offer the ability to examine simultaneously all the alleles at a locus in a population. Nevertheless, because of its sensitivity for the detection of polymorphisms, RAPD analysis and variations based on RAPD/PCR methods have become the methods of choice for analyzing genetic variation within species or closely related genera, both in the animal and plant kingdoms.
A third method more recently introduced for identifying and mapping genetic polymorphisms is termed amplified fragment length polymorphism or AFLP (M. Zabeau, EP 534,858). AFLP is similar in concept to RFLP in that restriction enzymes are used to specifically digest the genomic DNA to be analyzed. The primary difference between these two methods is that the amplified restriction fragments produced in AFLP are modified by the addition of specific, known adaptor sequences which serve as the target sites for PCR amplification with adaptor-directed primers. Briefly, restriction fragments are generated from genomic DNA by complete digestion with a single or double restriction enzyme combination, the latter using a "frequent" cutter combined with a "rare" cutter. Optimal results are obtained when one of these enzymes has a tetranucleotide recognition site, and the other enzyme a hexanucleotide site. Such a double enzyme digestion generates a mixture of single- and double-digested genomic DNA fragments. Next, double-stranded adaptors composed of synthetic oligonucleotides of moderate length (10-30 bases) are specifically ligated to the ends of the restriction fragments. The individual adaptors corresponding to the different restriction sites all carry distinct DNA sequences.
One of the adaptors, usually the one corresponding to the hexanucleotide-site restriction enzyme, carries a biotin moiety. The application of biotin-streptavidin capture methodology leads to the selective removal of all nonbiotinylated restriction fragments (those bordered at both ends by the tetranucleotide restriction site), and thus effectively enriches the population for fragments carrying the biotinylated adaptor at one or both ends. As a result, the DNA fragment mixture is also enriched for asymmetric fragments, those carrying a different restriction site at each end. The selected fragments serve as templates for PCR amplification using oligonucleotide primers that correspond to the adaptor/restriction site sequences. These adaptor-directed primers also can include at their 3' ends from 1 to 10 arbitrary nucleotides, which will anneal to and prime from the genomic sequence directly adjacent to the restriction site on the DNA fragment.
A PCR reaction using this type of pooled fragment template and adaptor-directed primers results in the co-amplification of multiple genomic fragments. Any DNA sequence differences between genomes in the region of the restriction sites or the 1-10 nucleotides directly adjacent to the restriction sites leads to differences, or polymorphisms (dominant and codominant), in the PCR products generated. Multiple fragments are simultaneously co-amplified, and some proportion of these will be polymorphic between genomes.
A fourth method of assaying polymorphisms has involved utilizing the high degree of length variation resulting from certain repeating nucleotide sequences found in most genomes. Most if not all eukaryotic genomes are populated with repeating base sequences variously termed simple sequence repeats (SSR), simple sequence length polymorphisms (SSLP), dinucleotide, trinucleotide, tetranucleotide, or pentanucleotide repeats, and microsatellites. Simple sequence repeats have been demonstrated to be useful as genetic markers (DE 38 34 636 A, Jackle et al.; Weber, et al, Am J Hum Genet 44, 388, (1989); Litt et al., Am J Hum Genet 4, 397, (1989)). Weber et al., (Genomics 11, 695, (1991)) have successfully used SSRs for comparative analysis and mapping of mammalian genomes, and several groups (Akkaya, et al., Genetics 132, 1131 (1992), Morgante & Olivieri, Plant J 3, 175 (1993), Wu & Tanksley, Mol. Gen. Genet. 241, 225, (1993)) have demonstrated similar results with plant genomes. SSR polymorphisms can be detected by PCR using minute amounts of genomic DNA and, unlike RAPDs, they provide codominant markers and can detect a high degree of genetic polymorphism (Weber, J. L. (1990) In Tilghman, S., Daves, K., (eds) `Genome Analysis vol. 1: Genetic and physical mapping`, Cold Spring Harbor Laboratory Press, pp 159-181.).
Although SSR-directed PCR primers are highly effective for detecting polymorphism, their use suffers from a variety of practical drawbacks. Typically, markers generated by these methods are obtained by first constructing a genomic library, screening the library with probes representing the core elements of a particular repeat sequence, purifying and sequencing the positive clones, and synthesizing the primers specific for the flanking sequences for each cloned SSR locus. Genomic DNA is then amplified to screen for polymorphisms, and mapping of the genome is then carried out. The entire process is time consuming, expensive and technically demanding, and as a result has been somewhat limited in its application.
At least one method has been developed as an attempt to circumvent these limitations and allow the use of polymorphic SSR markers more directly, with no a priori knowledge of particular SSR locus sequences. Zietkiewicz, et al., (Genomics, 20, 176, (1994)), for example, demonstrate that a single-primer PCR amplification can be used to detect length polymorphisms between adjacent (CA).sub.n repeats in animal and plant genomes. The PCR primers used for this assay each contain a particular SSR sequence that is flanked immediately 5' or 3' by 2 to 4 nucleotides of known or arbitrary sequence; these anchor sequences anneal to the non-SSR genomic sequences that flank the SSR sequence in the genome and serve to "anchor" the primer to a single position at each matching SSR locus. Radiolabeled SSR-to-SSR amplification products, generated when adjacent SSR sequences are oppositely oriented and spaced closely enough in the genome, are analysed by gel electrophoresis followed by autoradiography. This approach eliminates the need for cloning and sequencing SSRs from the genome, and reveals an enriched polymorphic banding pattern relative to single-locus SSR. Zietkiewicz et al, attribute the enriched pattern to use of the arbitrary sequence anchor, which allows the SSR primer to anneal and prime from many SSR target loci simultaneously.
In a concept similar to Zetkiewicz, Wu et al., (Nucleic Acids Res. 22, 3257, (1994)) teach a method for the detection of polymorphisms where genomic DNA is amplified by asymmetric thermally cycled PCR using radiolabeled 5' anchored primers consisting of microsatellite repeats in the presence of RAPD primers of arbitrary sequence. The method of Wu et al. is useful for the generation of genetic markers that incorporate many features of microsatellite repeats. Wu et al. does not disclose the use of compound microsatellite repeate primers for amplification.
Simple sequence repeats may be classified in several ways. In categorizing and characterizing human (CA).sub.n or (GT).sub.n repeats, Weber, J. L. (Genome Analysis Vol. 1, 159, 1990, Cold Spring harbor Laboratory Press, NY) defines at least three types of (CA).sub.n SSR: simple perfect SSR, simple imperfect SSR, and compound SSR. Each perfect SSR is considered to be a simple (CA).sub.n tandem sequence, with no interruptions within the repeat. Imperfect SSR are defined as those repeating sequences with one or more interruptions of up to 3 nonrepeat bases within the run of the repeat. Compound SSR are defined as those sequences with a CA or GT repeat stretch adjacent to or within 3 nucleotides of a block of short tandem repeats of a different sequence. Weber notes that perfect sequence repeats in humans comprise about 65% of the total (dC-dA).sub.n (dG-dT).sub.n sequences cloned from the genome, imperfect repeats about 25%, and compound repeats about 10%. Weber theorizes that because perfect repeats contain the longest uninterrupted repeat blocks, they appear to provide the most useful information. Weber also teaches that repeats composed of 12 or more uninterrupted units are consistently more polymorphic than are shorter repeat stretches. Because imperfect repeats generally contain shorter repeat stretches, they appear to be less useful as indicators of polymorphism. Compound repeats in general have not been well characterized, and their potential informativeness has not been clearly established.
Others have used the polymorphisms detectable within perfect, imperfect and compound SSR loci to build genetic linkage maps. Buchanan et al. (Mammalian Genome, 4, 258, (1993)) teach that there is little difference in the utility of the different SSR types in the ovine genome with respect to their absolute polymorphism levels; the perfect, imperfect and compound repeats although likely present in the genome at differing frequencies (perfect and imperfect simple SSR's are more frequent than compound) were found to have similar average Polymorphism Information Content (PIC) values as defined by Botstein et al. (Am. J. Hum. Genet., 32, 314, (1980)). In a study of (GT).sub.n SSR in the Atlantic salmon genome, Slettan et al. (Animal Genetics, 24, 195, (1993)) found both perfect and imperfect simple SSR but no compound repeats. In an examination of the equine genome, Ellegren et al. (Amimal Genetics, 23, 133, (1993)) identified the highest levels of polymorphism involving (TG).sub.n and (TC).sub.n repeats among horse genotypes using primers designed to amplify perfect or imperfect simple repeats; although two of eight cloned (GT).sub.n repeats were identified to be compound in structure (one perfect, one imperfect), neither was characterized further. Condit & Hubbell (Genome 34, 66 (1991)), in characterizing large-insert clones carrying (AC).sub.n and (AG).sub.n repeats from tropical trees and maize, found that 10-20% of inserts carrying one type of repeat also carried the other, and that many (AC).sub.n sites also had other two-base repeats adjacent or nearby. Finally, Browne et al. (Nucl. Acids Res., 20, 141, (1991)), in an attempt to characterize (CA).sub.n SSR sequences in the human genome by DNA sequencing with degenerate (CA).sub.n primers, disclose that 88% of their (CA).sub.n repeats carried AT base pairs at one or both ends of the CA repeat.
To date, the record in the literature would indicate that although it varies with each type of genome, the incidence of compound SSRs in a genome is lower than that of either perfect of imperfect simple SSR sequences. Nevertheless, the information content (PIC value) of compound SSR sequences has been shown to be generally high. In addition, the literature would indicate that the detection of genetic polymorphisms by way of specifically isolating compound SSR loci generally would have marginal success; the use of probes or primers designed to recognize and thus specifically isolate individual compound SSR loci would be less efficient for generating large numbers of new SSR markers as compared to the isolation of the more numerous simple SSR sequences. Applicants have, however, unexpectedly discovered that compound SSR's, particularly those containing (AT).sub.n repeats are highly polymorphic in eukaryotic genomes, and that oligonucleotides designed to anneal specifically to a specific type of SSR, termed herein as a perfect in-phase compound SSR, are particularly useful in the generation and detection of polymorphisms between eukaryotic genomes. A "perfect" compound SSR is one in which two different repeating sequences, each of which could be composed of di-, tri-, tetra-, or penta-nucleotide units, are located very near each other, with no more than 3 intervening bases between the two repeat blocks. One category of perfect compound repeat is one in which the two constituent repeats are immediately adjacent to one another, with no intervening bases. Further, a perfect compound SSR can be classified to be "in-phase" if both of the component simple repeats share a common nucleotide whose spacing is conserved across the repeat junction and over the length of the two repeat blocks. For example, the in-phase perfect compound SSR, (AT).sub.n (AG).sub.n, maintains the adenosine base "in-phase" across both components of this perfect compound structure.
Applicants have discovered that in-phase perfect compound SSR sequences such as (dC-dA).sub.n (dT-dA).sub.n are abundant in both animal and plant genomes. Although the frequency of occurrence of each type of perfect compound SSR sequence varies within, as well as between, species, those sequences that are in-phase are of sufficiently high frequency in all eukaryotic genomes examined, and appear to be both well dispersed and highly polymorphic. Based upon their observation that the junction spanned by such directly adjacent, in-phase perfect repeats is absolutely predictable, Applicants have developed methodology which utilizes synthetic oligonucleotides containing in-phase compound sequences as self-anchoring primers in new variations of polymerase chain reaction-based multiplexed genome assays, including inter-repeat amplification and amplified fragment length polymorphism assays. Applicants have found that the 5' end of the compound SSR primer serves as an extremely efficient anchor base for primer extension that occurs from the 3'-end repeat. This primer extension initiates from inside the compound SSR target sequence, such that any length variation between different alleles at a target SSR locus is detectable as a corresponding length bariation in the resulting amplification products. Because such use of perfect compound SSRs as amplification primers generates multiple products wherein a high proportion are polymorphic (as high as 80%), Applicants believe that the method of their invention greatly facilitates the simultaneous identification of multiple genomic polymorphisms, both codominant and dominant. Thus, the present method offers great advantage in identifying polymorphic markers linked to genetic traits of interest, and also offers an efficient and convenient generic technique for genome fingerprinting and whole-genome comparisons.