The present invention pertains to a process for analyzing a DNA molecule. More specifically, the present invention is related to performing experiments that produce quantitative data, and then analyzing these data to characterize a DNA fragment. The invention also pertains to systems related to this DNA fragment information.
With the advent of high-throughput DNA fragment analysis by electrophoretic separation, many useful genetic assays have been developed. These assays have application to genotyping, linkage analysis, genetic association, cancer progression, gene expression, pharmaceutical development, agricultural improvement, human identity, and forensic science.
However, these assays inherently produce data that have signficant error with respect to the size and concentration of the characterized DNA fragments. Much calibration is currently done to help overcome these errors, including the use of in-lane molecular weight size standards. In spite of these improvements, the variability of these properties (between different instruments, runs, or lanes) can exceed the desired tolerance of the assays.
Recently, advances have been made in the automated scoring of genetic data. Many naturally occurring artifacts in the amplification and separation of nucleic acids can be eliminated through calibration and mathematical processing of the data on a computing device (M W Perlin, M B Burks, R C Hoop, and E P Hoffman, xe2x80x9cToward fully automated genotyping: allele assignment, pedigree construction, phase determination, and recombination detection in Duchenne muscular dystrophy,xe2x80x9d Am. J. Hum. Genet., vol. 55, no. 4, pp. 777-787, 1994; M W Perlin, G Lancia, and S-K Ng, xe2x80x9cToward fully automated genotyping: genotyping microsatellite markers by deconvolution,xe2x80x9d Am. J. Hum. Genet., vol. 57, no. 5, pp. 1199-1210, 1995; S-K Ng, xe2x80x9cAutomating computational molecular genetics: solving the microsatellite genotyping problem,xe2x80x9d Carnegie Mellon University, Doctoral dissertation CMU-CS-98-105, Jan. 23, 1998), incorporated by reference.
This invention pertains to the novel use of calibrating data and mathematical analyses to computationally eliminate undesirable data artifacts in a nonobvious way. Specifically, the use of allelic ladders and coordinate transformations can help an automated data analysis system better reduce measurement variability to within a desired assay tolerance. This improved reproducibility is useful in that it results in greater accuracy and more complete automation of the genetic assays, often taking less time at a lower cost with fewer people.
Genotyping Technology
Genotyping is the process of determining the alleles at an individual""s genetic locus. Such loci can be any inherited DNA sequence in the genome, including protein-encoding genes and polymorphic markers. These markers include short tandem repeat (STR) sequences, single-nucleotide polymorphism (SNP) sequences, restriction fragment length polymorphism (RFLP) sequences, and other DNA sequences that express genetic variation (G Gyapay, J Morissette, A Vignal, C Dib, C Fizames, P Millasseau, S Marc, G Bernardi, M Lathrop, and J Weissenbach, xe2x80x9cThe 1993-94 Genethon Human Genetic Linkage Map,xe2x80x9d Nature Genetics, vol. 7, no. 2, pp. 246-339, 1994; P W Reed, J L Davies, J B Copeman, S T Bennett, S M Palmer, L E Pritchard, S C L Gough, Y Kawaguchi, H J Cordell, K M Balfour, S C Jenkins, E E Powell, A Vignal, and J A Todd, xe2x80x9cChromosome-specific microsatellite sets for fluorescence-based, semi-automated genome mapping,xe2x80x9d Nature Genet., vol. 7, no. 3, pp. 390-395, 1994; L Kruglyak, xe2x80x9cThe use of a genetic map of biallelic markers in linkage studies,xe2x80x9d Nature Genet., vol. 17, no. 1, pp. 21-24, 1997; D Wang, J Fan, C Siao, A Berno, P Young, R Sapolsky, G Ghandour, N Perkins, E Winchester, J Spencer, L Kruglyak, L Stein, L Hsie, T Topaloglou, E Hubbell, E Robinson, M Mittmann, M Morris, N Shen, D Kilburn, J Rioux, C Nusbaum, S Rozen, T Hudson, and E Lander, xe2x80x9cLarge-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome,xe2x80x9d Science, vol. 280, no. 5366, pp. 1077-82, 1998; P Vos, R Hogers, M Bleeker, M Reijans, T van de Lee, M Hornes, A Frijters, J Pot, J Peleman, M Kuiper, and M Zabeau, xe2x80x9cAFLP: a new technique for DNA fingerprinting,xe2x80x9d Nucleic Acids Res, vol. 23, no. 21, pp. 4407-14, 1995; J Sambrook, E F Fritsch, and T Maniatis, Molecular Cloning, Second Edition. Plainview, N.Y.: Cold Spring Harbor Press, 1989), incorporated by reference.
The polymorphism assay is typically done by characterizing the length and quantity of DNA from an individual at a marker. For example, STRs are assayed by polymerase chain reaction (PCR) amplification of an individual""s STR locus using a labeled PCR primer, followed by size separation of the amplified PCR fragments. Detection of the fragment labels, together with in-lane size standards, generates a signal that permits characterization of the size and quantity of the DNA fragments. From this characterization, the alleles of the STR locus in the individual""s genome can be determined (J Weber and P May, xe2x80x9cAbundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction,xe2x80x9d Am. J. Hum. Genet., vol. 44, pp. 388-396, 1989; J S Ziegle, Y Su, K P Corcoran, L Nie, P E Mayrand, L B Hoff, L J McBride, M N Kronick, and S R Diehl, xe2x80x9cApplication of automated DNA sizing technology for genotyping microsatellite loci,xe2x80x9d Genomics, vol. 14, pp. 1026-1031, 1992), incorporated by reference.
The labels can use radioactivity, fluorescence, infrared, or other nonradioactive labeling methods (F M Ausubel, R Brent, R E Kingston, D D Moore, J G Seidman, J A Smith, and K Struhl, ed., Current Protocols in Molecular Biology. New York, N.Y.: John Wiley and Sons, 1995; N J Dracopoli, J L Haines, B R Korf, C C Morton, C E Seidman, J G Seidman, D T Moir, and D Smith, ed., Current Protocols in Human Genetics. New York: John Wiley and Sons, 1995; L J Kricka, ed., Nonisotopic Probing, Blotting, and Sequencing, Second Edition. San Diego, Calif.: Academic Press, 1995), incorporated by reference.
Size separation of fragment molecules is typically done using gel or capillary electrophoresis (CE); newer methods include mass spectrometry and microchannel arrays (R A Mathies and X C Huang, xe2x80x9cCapillary array electrophoresis: an approach to high-speed, high-throughput DNA sequencing,xe2x80x9d Nature, vol. 359, pp. 167-169, 1992; K J Wu, A Stedding, and C H Becker, xe2x80x9cMatrix-assisted laser desorption time-of-flight mass spectrometry of oligonucleotides using 3-hydroxypicolinic acid as an ultraviolet-sensitive matrix,xe2x80x9d Rapid Commun. Mass Spectrom., vol. 7, pp. 142-146, 1993), incorporated by reference.
The label detection method is contingent on both the labels used and the size separation mechanism. For example, with automated DNA sequencers such as the PE Biosystems ABI/377 gel, ABI/310 single capillary or ABI/3700 capillary array instruments, the detection is done by laser scanning of the fluorescently labeled fragments, imaging on a CCD camera, and electronic acquisition of the signals from the CCD camera. Flatbed laser scanners, such as the Molecular Dynamics Fluorimager or the Hitachi FMBIO/II acquire flourescent signals similarly. Li-Cor""s infrared automated sequencer uses a detection technology modified for the infrared range. Radioactivity can be detected using film or phosphor screens. In mass spectrometry, the atomic mass can be used as a sensitive label. See (A. J. Kostichka, Bio/Technology, vol. 10, pp. 78, 1992), incorporated by reference.
Size characterization is done by comparing the sample fragment""s signal in the context of the size standards. By separate calibration of the size standards used, the relative molecular size can be inferred. This size is usually only an approximation to the true size in base pair units, since the size standards and the sample fragments generally have different chemistries and electrophoretic migration patterns (S-K Ng, xe2x80x9cAutomating computational molecular genetics: solving the microsatellite genotyping problem,xe2x80x9d Carnegie Mellon University, Doctoral dissertation CMU-CS-98-105, Jan. 23, 1998), incorporated by reference.
Quantitation of the DNA signal is usually done by examining peak heights or peak areas. One inexact peak area method simply records the area under the curve; this approach does not account for band overlap between different peaks. It is often useful to determine the quality (e.g., error, accuracy, concordance with expectations) of the size or quantity characterizations. See (D R Richards and M W Perlin, xe2x80x9cQuantitative analysis of gel electrophoresis data for automated genotyping applications,xe2x80x9d Amer. J. Hum. Genet., vol. 57, no. 4 Supplement, pp. A26, 1995), incorporated by reference.
The actual genotyping result depends on the type of genotype, the technology used, and the scoring method. For example, with STR data, following size separation and characterization, the sizes (exact, rounded, or binned) of the two tallest peaks might be used as the alleles. Alternatively, PCR artifacts (e.g., stutter, relative amplification) can be accounted for in the analysis, and the alleles determined after mathematical corrections have been applied. See (M W Perlin, xe2x80x9cMethod and system for genotyping,xe2x80x9d U.S. Pat. No. 5,541,067, Jul. 30, 1996; M W Perlin, xe2x80x9cMethod and system for genotyping,xe2x80x9d U.S. Pat. No. 5,580,728, Dec. 3, 1996), incorporated by reference.
Genotyping Applications
Genotyping data can be used to determine how mapped markers are shared between related individuals. By correlating this sharing information with phenotypic traits, it is possible to localize a gene associated with that inherited trait. This approach is widely used in genetic linkage and association studies (J Ott, Analysis of Human Genetic Linkage, Revised Edition. Baltimore, Md.: The Johns Hopkins University Press, 1991; N Risch, xe2x80x9cGenetic Linkage and Complex Diseases, With Special Reference to Psychiatric Disorders,xe2x80x9d Genet. Epidemiol., vol. 7, pp. 3-16, 1990; N Risch and K Merikangas, xe2x80x9cThe future of genetic studies of complex human diseases,xe2x80x9d Science, vol. 273, pp. 1516-1517, 1996), incorporated by reference.
Genotyping data can also be used to identify individuals. For example, in forensic science, DNA evidence can connect a suspect to the scene of a crime. DNA databases can provide a repository of such relational information (C P Kimpton, P Gill, A Walton, A Urquhart, E S Millican, and M Adams, xe2x80x9cAutomated DNA profiling employing multiplex amplification of short tandem repeat loci,xe2x80x9d PCR Meth. Appl., vol. 3, pp. 13-22, 1993; J E McEwen, xe2x80x9cForensic DNA data banking by state crime laboratories,xe2x80x9d Am. J. Hum. Genet., vol. 56, pp. 1487-1492, 1995; K Inman and N Rudin, An Introduction to Forensic DNA Analysis. Boca Raton, Fla.: CRC Press, 1997; C J Fregeau and R M Fourney, xe2x80x9cDNA typing with fluorescently tagged short tandem repeats: a sensitive and accurate approach to human identification,xe2x80x9d Biotechniques, vol. 15, no. 1, pp. 100-119, 1993), incorporated by reference.
Linked genetic markers can help predict the risk of disease. In monitoring cancer, STRs are used to assess microsatellite instability (MI) and loss of heterozygosity (LOH)xe2x80x94chromosomal alterations that reflect tumor progression. (ID Young, Introduction to Risk Calculation in Genetic Counselling. Oxford: Oxford University Press, 1991; L Cawewell, L Ding, F A Lewis, I Martin, M F Dixon, and P Quirke, xe2x80x9cMicrosatellite instability in colorectal cancer: improved assessment using fluorescent polyterase chain reaction,xe2x80x9d Gastroenterology, vol. 109, pp. 465-471, 1995; F Canzian, A Salovaara, P Kristo, R B Chadwick, L A Aaltonen, and A de la Chapelle, xe2x80x9cSemiautomated assessment of loss of heterozygosity and replication error in tumors,xe2x80x9d Cancer Research, vol. 56, pp. 3331-3337, 1996; S Thibodeau, G Bren, and D Schaid, xe2x80x9cMicrosatellite instability in cancer of the proximal colon,xe2x80x9d Science, vol. 260, no. 5109, pp. 816-819, 1993), incorporated by reference.
For crop and animal improvement, genetic mapping is a very powerful tool. Genotyping can help identify useful traits of nutritional or economic importance. (H J Vilkki, D J de Koning, K Elo, R Velmala, and A Maki-Tanila, xe2x80x9cMultiple marker mapping of quantitative trait loci of Finnish dairy cattle by regression,xe2x80x9d J. Dairy Sci., vol. 80, no. 1, pp. 198-204, 1997; S M Kappes, J W Keele, R T Stone, R A McGraw, T S Sonstegard, T P Smith, N L Lopez-Corrales, and C W Beattie , xe2x80x9cA second-generation linkage map of the bovine genome,xe2x80x9d Genome Res., vol. 7, no. 3, pp. 235-249, 1997; M Georges, D Nielson, M Mackinnon, A Mishra, R Okimoto, A T Pasquino, L S Sargeant, A Sorensen, M R Steele, and X Zhao, xe2x80x9cMapping quantitative trait loci controlling milk production in dairy cattle by exploiting progeny testing,xe2x80x9d Genetics, vol. 139, no. 2, pp. 907-920, 1995; G A Rohrer, d J Alexander, Z Hu, T P Smith, J W Keele, and C W Beattie, xe2x80x9cA comprehensive map of the porcine genome,xe2x80x9d Genome Res., vol. 6, no. 5, pp. 371-391, 1996; J Hillel, xe2x80x9cMap-based quantitative trait locus identification,xe2x80x9d Poult. Sci., vol. 76, no. 8, pp. 1115-1120, 1997; H H Cheng, xe2x80x9cMapping the chicken genome,xe2x80x9d Poult. Sci., vol. 76, no. 8, pp. 1101-1107, 1997), incorporated by reference.
Other Sizing Assays
Fragment analysis finds application in other genetic methods. Often fragment sizes are used to multiplex many experiments into one shared readout pathway, where size (or size range) serves an index into post-readout demultiplexing. For example, multiple genotypes are typically pooled into a single lane for more efficient readout. Quantifying information can help determine the relative amounts of nucleic acid products present in tissues. (G R Taylor, J S Noble, and R F Mueller, xe2x80x9cAutomated analysis of multiplex microsatellites,xe2x80x9d J. Med. Genet, vol. 31, pp. 937-943, 1994; L S Schwartz, J Tarleton, B Popovich, W K Seltzer, and E P Hoffmn, xe2x80x9cFluorescent multiplex linkage analysis and carrier detection for Duchenne/Becker muscular dystrophy,xe2x80x9d Am. J. Hum. Genet., vol. 51, pp. 721-729, 1992; C P Kimpton, P Gill, A Walton, A Urquhart, E S Millican, and M Adams, xe2x80x9cAutomated DNA profiling employing multiplex amplification of short tandem repeat loci,xe2x80x9d PCR Meth. Appl., vol. 3, pp. 13-22, 1993), incorporated by reference.
Differential display is a gene expression assay. It performs a reverse transcriptase PCR (RT-PCR) to capture the state of expressed mRNA olecules into a more robust DNA form. These DNAs are then size separated, and the size bins provide an index into particular molecules. Variation at a size bin between two tissue assays is interpreted as a concommitant variation in the underlying mRNA gene expression profile. A peak quantification at a bin estimates the underlying mRNA concentration. Comparison of the quantitation of two different samples at the same bin provides a measure of relative up- or down-regulation of gene expression. (S W Jones, D Cai, O S Weislow, and B Esmaeli-Azad, xe2x80x9cGeneration of multiple mRNA fingerprints using fluorescence-based differential display and an automated DNA sequencer,xe2x80x9d BioTechniques, vol. 22, no. 3, pp. 536-543, 1997; P Liang and A Pardee, xe2x80x9cDifferential display of eukaryotic messenger RNA by means of the polymerase chain reactions,xe2x80x9d Science, vol. 257, pp. 967-971, 1992; K R Luehrsen, L L Marr, E van der Knaap, and S Cumberledge, xe2x80x9cAnalysis of differential display RT-PCR products using fluorescent primers and Genescan software,xe2x80x9d BioTechniques, vol. 22, no. 1, pp. 168-174, 1997), incorporated by reference.
Single stranded conformer polymorphism (SSCP) is a method for detecting different mutations in a gene. Single base pair changes can markedly affect fragment mobility of the conformer, and these mobility changes can be detected in a size separation assay. SSCP is of particular use in identifying and diagnosing genetic mutations (M Orita, H Iwahana, H Kanazawa, K Hayashi, and T Sekiya, xe2x80x9cDetection of polymorphisms of human DNA by gel electrophoresis as single-strand conformation polymorphisms,xe2x80x9d Proc Natl Acad Sci USA, vol. 86, pp. 2766-2770, 1989), incorporated by reference.
The AFLP technique provides a very powerful DNA fingerprinting technique for DNAs of any origin or complexity. AFLP is based on the selective PCR amplification of restriction fragments from a total digest of genomic DNA. The technique involves three steps: (i) restriction of the DNA and ligation of oligonucleotide adapters, (ii) selective amplification of sets of restriction fragments, and (iii) gel analysis of the amplified fragments. PCR amplification of restriction fragments is achieved by using the adapter and restriction site sequence as target sites for primer annealing. The selective amplification is achieved by the use of primers that extend into the restriction fragments, amplifying only those fragments in which the primer extensions match the nucleotides flanking the restriction sites. Using this method, sets of restriction fragments may be visualized by PCR without knowledge of nucleotide sequence. The method allows the specific co-amplification of high numbers of restriction fragments. The number of fragments that can be analyzed simultaneously, however, is dependent on the resolution of the detection system. Typically 50-100 restriction fragments are amplified and detected on denaturing polyacrylamide gels. (P Vos, R Hogers, M Bleeker, M Reijans, T van de Lee, M Hornes, A Frijters, J Pot, J Peleman, M Kuiper, and M Zabeau, xe2x80x9cAFLP: a new technique for DNA fingerprinting,xe2x80x9d Nucleic Acids Res, vol. 23, no. 21, pp. 4407-14, 1995), incorporated by reference.
Data Scoring
The final step in any fragment assay is scoring the data. This is typically done by having people visually review every experiment. Some systems (e.g., PE Informatics"" Genotype program) perform an initial computer review of the data, to make the manual visual review of every genotype easier. More advanced systems (e.g., Cybergenetics"" TrueAllele technology) fully automate the data review, and provide data quality scores that can be used to identify data artifacts (for eliminating such data from consideration) and rank the data scores (to focus on just the 2%-25% of suspect data calls). See (B Palsson, F Palsson, M Perlin, H Gubjartsson, K Stefansson, and J Gulcher, xe2x80x9cUsing quality measures to facilitate allele calling in high-throughput genotyping,xe2x80x9d Genome Research, vol. 9, no. 10, pp. 1002-1012, 1999; M W Perlin, xe2x80x9cMethod and system for genotyping,xe2x80x9d U.S. Pat. No. 5,876,933, Mar. 2, 1999), incorporated by reference.
However, even with such advanced scoring technology, artifacts can obscure the results. More importantly, insufficient data calibration can preclude the achievement of very low (e.g.,  less than 1%) data error rates, regardless of the scoring methods. For example, in high-throughput STR genotyping, differential migration of a sample""s PCR fragments relative to the size standards can produce subtle shifts in detected size. This problem is worse when different instruments are used, or when size separation protocols are not entirely uniform. The result is that fragments can be incorrectly assigned to allele bins in a way that cannot be corrected without recourse to additional information (e.g., pedigree data) completely outside the STR sizing assay.
Whole System
This invention centers on a new way to greatly reduce sizing and quantitation errors in fragment analysis. By designing data generation experiments that include the proper calibration data (e.g., internal lane standards, allelic ladders, uniform run conditions), most of these fragment analysis errors can be eliminated entirely. Moreover, computer software can be devised that fully exploits these data calibrations to automatically identify artifacts and rank the data by quality. The result is a largely error-free system that requires minimal (if any) human intervention.
The present invention pertains to a method for analyzing a nucleic acid sample. The method comprises the steps of forming labeled DNA sample fragments from a nucleic acid sample. Then there is the step of size separating and detecting said sample fragments to form a sample signal. Then there is the step of forming labeled DNA ladder fragments corresponding to molecular lengths. Then there is the step of size separating and detecting said ladder fragments to form a ladder signal. Then there is the step of transforming the sample signal into length coordinates using the ladder signal. Then there is the step of analyzing the nucleic acid sample signal in length coordinates.
The present invention also pertains to a system for analyzing a nucleic acid sample. The system comprises means for forming labeled DNA sample fragments from a nucleic acid sample. The system further comprises means for size separating and detecting said sample fragments to form a sample signal, said separating and detecting means in communication with the sample fragments. The system further comprises means for forming labeled DNA ladder fragments corresponding to molecular lengths. The system further comprises means for size separating and detecting said ladder fragments to form a ladder signal, said separating and detecting means in communication with the ladder fragments. The system further comprises means for transforming the sample signal into length coordinates using the ladder signal, said transforming means in communication with the signals. The system further comprises means for analyzing the nucleic acid sample signal in length coordinates, said analyzing means in communication with the transforming means.
The present invention also pertains to a method for generating revenue from computer scoring of genetic data. The method comprises the steps of supplying a software program that automatically scores genetic data. Then there is the step of forming genetic data that can be scored by the software program. Then there is the step of scoring the genetic data using the software program to form a quantity of genetic data. Then there is the step of generating a revenue from computer scoring of genetic data that is related to the quantity.
The present invention also pertains to a method for producing a nucleic acid analysis. The method comprises the steps of analyzing a first nucleic acid sample on a first size separation instrument to form a first signal. Then there is the step of analyzing a second nucleic acid sample on a second size separation instrument to form a second signal. Then there is the step of comparing the first signal with the second signal in a computing device with memory to form a comparison. Then there is the step of producing a nucleic acid analysis of the two samples from the comparison that is independent of the size separation instruments used.
The present invention also pertains to a method for resolving DNA mixtures. The method comprises the steps of obtaining DNA profile data that include a mixed sample. Then there is the step of representing the data in a linear equation. Then there is the step of deriving a solution from the linear equation. Then there is the step of resolving the DNA mixture from the solution.