To study polymorphisms in genomes, reliable allele determination of genetic markers is required for accurate genotyping. A genetic marker corresponds to a relatively unique location on a genome, with normal mammalian individuals having two (possibly identical) alleles 104 for a marker on an autosomal chromosome 102, referring to FIG. 1A. (Though there are other cases of 0, 1, or many alleles that this invention addresses, this characterization suffices for the background introduction.) One important class of markers is the CA-repeat loci. This class is abundantly represented throughout the genomes of many species, including humans.
A CA-repeat marker allele is comprised of a nucleic acid word 106
PQRST, PA1 R=(CA).sub.n, where is n is an integer that generally ranges between ten and fifty. Thus, the length of the allele sequence uniquely determines the content of the sequence, since the only polymorphism is in the length of R. PA1 PQ (CA).sub.n ST, PA1 { PQ (CA).sub.n ST, PQ (CA).sub.n-1 ST, PQ (CA).sub.n-2 ST, . . . }
where P is the left PCR primer, T defines the right PCR primer, Q and S are relatively fixed sequences, and the primary variation occurs in the sequence R, which is a tandemly repeated sequence 108 of the dinucleotide CA, i.e.,
One can therefore obtain genomic DNA, perform PCR amplification of a CA-repeat genetic marker location, and then assay the length of the allele sequences by differential sizing, typically done by differential migration of DNA molecules using gel electrophoresis. The resulting gel 110 should, in principle, clearly show the alleles of marker for each individual's genome. Further, these sizes can be quantitated by using reference markers 112.
However, the PCR amplification of a CA-repeat location produces an artifact, often termed "PCR stutter". Most likely due to slippage of the polymerase molecule on the nucleic acid polymer in the highly repetitive CA-repeat region, the result is that PCR products are produced that correspond to deletions of tandem CA molecules in the repeat region. Thus, instead of a single band on a gel corresponding to the one molecule
an entire population of different size bands
in varying concentrations is observed. This PCR stuttering 114 can be viewed as a spatial pattern p(x), or, alternatively, as a response function r(t) of an impulse signal corresponding to the assayed allele.
The stutter artifact can be extremely problematic when the two alleles of an autosomal CA-repeat marker are close in size. Then, their two stutter patterns overlap, producing a complex signal 116. In the presence of background measurement noise, this complexity often precludes unambiguous determination of the two alleles. To date, this has prevented reliable automated (or even manual) genotyping of CA-repeat markers from differential sizing assays.
This overlap of stutter patterns can be modeled as a superposition of two corrupted signals. Importantly, (1) the corrupting response function is roughly identical for two closely sized alleles of the same CA-repeat marker, and (2) this response function is largely determined by the specific CA-repeat marker, the PCR conditions, and possibly the relative size of the allele. Thus, the response functions can be assayed separately from the genotyping experiment. By combining 118 the corrupted signal together with the determined response functions of the CA-repeat marker, the true uncorrupted allele sizes can be determined, and reliable genotyping can be performed.
A primary goal of the NIH/DOE Human Genome Project during its initial 5 year phase of operation was to develop a genetic map of humans with markers spaced 2 to 5 cM apart (E. P. Hoffman, "The Human Genome Project: Current and future impact," Am. J. Hum. Genet., vol. 54, pp. 129-136, 1994), incorporated by reference. This task has already been largely accomplished in half the time anticipated, with markers that are far more informative than originally hoped for. In these new genetic maps, restriction fragment length polymorphism (RFLP) loci have been entirely replaced by CA repeat loci (dinucleotide repeats, also termed "microsatellites") (J. Weber and P. May, "Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction," Am J Hum Genet, vol. 44, pp. 388-396, 1989; J. Weber, "Length Polymorphisms in dC-dA . . . dG-dT Sequences," Marshfield Clinic, Marshfield, Wis., assignee code 354770, U.S. Pat. No. 5075217, 1991), incorporated by reference, and other short tandem repeat markers (STRs). It is expected that at least 30,000 CA-repeat markers will be made available in public databases in the form of PCR primer sequences and reaction conditions. One of the advantages of CA repeat loci is their high density in the genome, with about 1 informative CA repeat every 50,000 bp: this permits a theoretical density of approximately 20 loci per centimorgan. Another advantage of CA repeat polymorphisms is their informativeness, with most loci in common use having PIC values of over 0.70 (J. Weissenbach, G. Gyapay, C. Dib, A. Vignal, J. Morissette, P. Millasseau, G. Vaysseix, and M. Lathrop, "A second generation linkage map of the human genome," Nature, vol. 359, pp. 794-801, 1992), incorporated by reference. Finally, these markers are PCR-based, permitting rapid genotyping using minute quantities of input genomic DNA. Taken together, these advantages have facilitated linkage studies by orders of magnitude: a single full-time scientist can cover the entire genome at a 10 cM resolution and map a disease gene in an autosomal dominant disease family in about 1 year (D. A. Stephan, N. R. M. Buist, A. B. Chittenden, K. Ricker, J. Zhou, and E. P. Hoffman, "A rippling muscle disease gene is localized to 1q41: evidence for multiple genes," Neurology, vol. (in press), 1994), incorporated by reference.
The CA repeat-based genetic maps are not without disadvantages. First, alleles are detected by size differences in PCR products, which often differ by as little as 2 bp in a 300 bp PCR product. Thus, these alleles must be distinguished using high resolution sequencing gels, which are more labor intensive and technically demanding to use than most other electrophoresis systems. Second, referring to FIG. 2, CA repeat loci often show secondary "stutter" or "shadow" bands in addition to the band corresponding to the primary allele, thereby complicating allele interpretation. These stutter bands may be due to errors in Taq polymerase replication during PCR, secondary structure in PCR products, or somatic mosaicism for allele size in a patient. Allele interpretation is further complicated by the differential mobility of the two complementary DNA strands of the PCR products when both are labelled. Finally, sequencing gels often show inconsistencies in mobility of DNA fragments, making it difficult to compare alleles of individuals between gels and often within a single gel. The most common experimental approach used for typing CA repeat alleles involves incorporation of radioactive nucleotide precursors into both strands of the PCR product. The combined consequence of stutter peaks and visualization of both strands of alleles differing by 2 bp often leads to considerable "noise" on the resulting autoradiograph "signals", referring to FIG. 2, which then requires careful subjective interpretation by an experienced scientist in order to determine the true underlying two alleles.
The stuttered signals of di-, tri-, tetra-, and other multi-nucleotide repeats can be modeled as the convolution of the true allele sizes with a stutter pattern p(x). Under this model, the complex quantitative banding signal q(x) observed on a gel can be understood as the summation of shifted patterns p(x), with one shifted pattern for each allele size. A key fact is that generally only one p(x) function is associated with a given genetic marker, its PCR primers and conditions, and the allele size. In the important case of two alleles, where the two allele sizes are denoted by s and t, one can write the expression EQU q(x)=(x.sup. +x.sup.t)p(x).
The multiplication of the polynomial expressions (x.sup.s +x.sup.t) and p(x) is one implementation of the underlying (shift and add) convolution process. Given the observed data q(x) and the known stutter pattern p(x), one can therefore determine the unknown allele sizes s and t via a deconvolution procedure. (Note that this convolution/deconvolution model extends to analyses with more than two alleles.)
A corollary of highly dense and informative genetic maps is the need to accurately acquire, analyze and store large volumes of data on each individual or family studied. For example, a genome-wide linkage analysis on a 30 member pedigree at 10cM resolution would generate data for approximately 30,000 alleles, with many markers showing five or more alleles. Currently, alleles are visually interpreted and then manually entered into spreadsheets for analysis and storage. This approach requires a large amount of time and effort, and introduces the high likelihood of human error. Moreover, future studies of complex multifactorial disease loci will require large-scale genotyping on hundreds or thousands of individuals. Finally, manual genotyping is arduous, boring, time consuming, and highly error prone. Each of these features suggests that automation of genotype data generation, acquisition, interpretation, and storage is required to fully utilize the developing genetic maps. Some effort has been made to assist in allele identification and data storage (ABI Genotyper manual and software, Applied Biosystems Inc.), incorporated by reference. However, this software still requires substantial user interaction to place manually assigned alleles into a spreadsheet, and is unable to deconvolve (hence cannot accurately genotype) closely spaced alleles or perform other needed analyses. Importantly, no essential use is made of a CA-repeat marker's PCR stutter response pattern by the ABI software or by any other disclosed method or system for genotyping.
The Duchenne/Becker muscular dystrophy (DMD/BMD) gene locus (dystrophin gene) (A. P. Monaco, R. L. Neve, C. Colletti-Feener, C. J. Bertelson, D. M. Kurnit, and L. M. Kunkel, "Isolation of candidate cDNAs for portions of the Duchenne muscular dystrophy gene," Nature, vol. 323, pp. 646-650, 1986; M. Koenig, E. P. Hoffman, C. J. Bertelson, A. P. Monaco, C. Feener, and L. M. Kunkel, "Complete cloning of the Duchenne muscular dystrophy cDNA and preliminary genomic organization of the DMD gene in normal and affected individuals," Cell, vol. 50, pp. 509-517, 1987), incorporated by reference, is a useful experimental system for illustrating the automation of genetic analysis. The dystrophin gene can be considered a mini-genome: it is by far the largest gene known to date (2.5 million base pairs); it has a high intragenic recombination rate (10 cM, i.e., 10% recombination between the 5' and 3' ends of the gene); and it has a considerable spontaneous mutation rate (10.sup.-4 meioses). Mutation of the dystrophin gene results in one of the most common human lethal genetic diseases, and the lack of therapies for DMD demands that molecular diagnostics be optimized. The gene is very well characterized, with both precise genetic maps (C. Oudet, R. Heilig, and J. Mandel, "An informative polymorphism detectable by polymerase chain reaction at the 3' end of the dystrophin gene," Hum Genet, vol. 84, pp. 283-285, 1990), incorporated by reference, and physical maps (M. Burmeister, A. Monaco, E. Gillard, G. van Ommen, N. Affara, M. Ferguson-Smith, L. Kunkel, and H. Lehrach, "A 10-megabase physical map of human Xp21, including the Duchenne muscular dystrophy gene," Genomics, vol. 2, pp. 189-202, 1988), incorporated by reference. Finally, approximately one dozen CA repeat loci distributed throughout the dystrophin gene have been isolated and characterized (A. Beggs and L. Kunkel, "A polymorphic CACA repeat in the 3' untranslated region of dystrophin," Nucleic Acids Res, vol. 18, pp. 1931, 1990; C. Oudet, R. Heilig, and J. Mandel, "An informative polymorphism detectable by polymerase chain reaction at the 3' end of the dystrophin gene," Hum Genet, vol. 84, pp. 283-285, 1990; P. Clemens, R. Fenwick, J. Chamberlain, R. Gibbs, M. de Andrade, R. Chakraborty, and C. Caskey, "Linkage analysis for Duchenne and Becker muscular dystrophies using dinucleotide repeat polymorphisms," Am J Hum Genet, vol. 49, pp. 951-960, 1991; C. Feener, F. Boyce, and L. Kunkel, "Rapid detection of CA polymorphisms in cloned DNA: application to the 5' region of the dystrophin gene," Am J Hum Genet, vol. 48, pp. 621-627, 1991), incorporated by reference.
Many of the problems with interpretation of dystrophin gene CA repeat allele data can be overcome by single or multiplex fluorescent PCR and data acquistion on automated sequencers (L. S. Schwartz, J. Tarleton, B. Popovich, W. K. Seltzer, and E. P. Hoffman, "Fluorescent Multiplex Linkage Analysis and Carrier Detection for Duchenne/Becker Muscular Dystrophy," Am. J. Hum. Genet., vol. 51, pp. 721-729, 1992), incorporated by reference. This approach uses fluorescently labeled PCR primers to simultaneously amplify four CA repeat loci in a single reaction. By visualizing only a single strand of the PCR product, and by reducing the cycle number, much of the noise associated with these CA repeat loci was eliminated. Moreover, the production of fluorescent multiplex reaction kits provides a standard source of reagents which, have not deteriorated 3 years after the fluorescent labeling reactions were performed. In this previous report, referring to FIG. 2, alleles were manually interpreted from the automated sequencer traces.
This invention pertains to automating data acquisition and interpretation for any STR genetic marker. In the preferred embodiment, the invention: identifies each of the marker alleles at an STR locus in an organism; deconvolves complex "stuttered" alleles which differ by as few as two bp (i.e., at the limits of signal/noise); makes this genotyping information available for further genetic analysis. For example, to establish DMD diagnosis by linkage analysis in pedigrees, the application system: identifies each of the dystrophin gene alleles in pedigree members; deconvolves complex "stuttered" alleles which differ by only two bp where signal/noise is a particular problem; reconstructs the pedigrees from lane assignment information; sets phase in females; propagates haplotypes through the pedigree; identifies female carriers and affected males in the pedigree based on computer derivation of an at-risk haplotype; detects and localizes recombination events within the pedigree. Other uses of automatically acquired STR genetic marker data are the construction of genetic maps (T. C. Matise, M. W. Perlin, and A. Chakravarti, "Automated construction of genetic linkage maps using an expert system (MultiMap): application to 1268 human microsatellite markers," Nature Genetics, vol. 6, no. 4, pp. 384-390, 1994), incorporated by reference, the localization of genetic traits onto chromosomes (J. Ott, Analysis of Human Genetic Linkage, Revised Edition. Baltimore, Md.: The Johns Hopkins University Press, 1991), incorporated by reference, and the positional cloning of genes derived from such localizations (B.-S. Kerem, J. M. Rommens, J. A. Buchanan, D. Markiewicz, T. K. Cox, A. Chakravarti, M. Buchwald, and L.-C. Tsui, "Identification of the cystic fibrosis gene: genetic analysis," Science, vol. 245, pp. 1073-1080, 1989; J. R. Riordan, J. M. Rommens, B.-S. Kerem, N. Alon, R. Rozmahel, Z. Grzelczak, J. Zielenski, S. Lok, N. Plavsic, J.-L. Chou, M. L. Drumm, M. C. Iannuzzi, F. S. Collins, and L.-C. Tsui, "Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA," Science, vol. 245, pp. 1066-1073, 1989), incorporated by reference.