1. Field of the Invention
The present invention concerns automated pattern recognition processes. More particularly, the present invention concerns interpreting data obtained by analysis of nucleic acids by generation of nucleic acid data in a spatial domain, transformation of the data from the spatial domain to a frequency domain, and obtaining sequence data of the nucleic acid data by executing a data mining process on the transformed data.
2. Description of the Related Art
Molecular genetics is one among several disciplines that has accumulated large, complex, information-rich datasets as a result of improved data collection technologies and decreased data storage costs. As a result, a gap between the ability to collect data and the ability to analyze, summarize, classify, and exploit the data for the advancement of biomedical research and patient care is widening rapidly.
In the last decade, major advances in molecular biology have made the need for computer software that can analyze and interpret molecular data rapidly and accurately a necessity. This is primarily due to two major advances in molecular biology that facilitated the rapid development of thousands of genetic markers. First, in 1985, Dr. Kary Mullis discovered that short segments of DNA could be amplified from templates using an enzyme called DNA polymerase and temperature cycling in a process called the polymerase chain reaction (PCR). PCR can amplify over a million duplicate copies of specific DNA sequences in a matter of hours. PCR revolutionized genetic research because it is a fast, inexpensive, and easily automated technique for amplifying minute quantities of DNA for genetic analysis.
Second, in 1989, several laboratories used PCR to demonstrate a high level of polymorphism in a class of tandemly repeated DNA sequences known as microsatellites. The discovery of microsatellites yielded several thousand new highly informative genetic markers and greatly advanced the construction of high-resolution linkage maps.
For a better understanding of how molecular data is obtained for analysis and interpretation, consider the process for human genotyping depicted in FIG. 1. As seen in FIG. 1, a typical genotyping process generally consists of five basic steps: 1) genomic DNA acquisition, 2) multiplexed PCR amplification of microsatellites using flourescently labeled primers, 3) gel electophoresis (allele separation by size), 4) laser-induced fluorescence (allele separation by color), and 5) interpretation of results to determine a genotype.
Acquiring DNA for genotyping can be performed by obtaining DNA primarily from blood, but can also be obtained from bone, hair, and various other fluids, tissues, and cells.
After a sample of DNA is acquired, the different alleles that exist at specific microsatellite marker locations of interest are amplified by PCR in sufficient quantities for subsequent analytical processing. A pair of PCR primers is designed to amplify the alleles at each marker location. The simultaneous amplification of multiple microsatellites using multiple pairs of primers in a single polymerase chain reaction is called multiplexing. This approach allows hundreds of microsatellites to be amplified in a single experiment.
Multiplexing often generates PCR products that overlap in size, making them difficult to separate. However, multiplexed PCR is greatly enhanced by the use of fluorescent labeling technology. By attaching different fluorescent labels to PCR primers, a scanning laser can be used to distinguish the different alleles by different wavelengths, even when their sizes overlap.
Alleles are typically separated by size in a process called gel electrophoresis. The gel electrophoresis process uses an electric current to force molecules through pores in a thin layer of polyacrylamide gel. The gel is made with pores designed for separating molecules in specific size ranges. The electric current causes the alleles to travel across the gel, with smaller alleles traveling farther across the gel than larger alleles. Fluorescent size standards are also included to calibrate and improve the accuracy of allele size determination.
When excited by a laser, the fluorescent labels on the PCR primers emit light at specific wavelengths corresponding to different colors in the visible light spectrum. Automated DNA sequencers typically use a scanning laser to detect the fluorescently-labeled alleles on each polyacrylamide gel. A digital detector records the multicolored fluorescence signals and stores them in machine-readable form. In situations where gel electrophoresis aggregates multiple alleles of similar size, they can be distinguished from one another by their fluorescent labels.
Finally, the electrophoretic patterns must be interpreted to establish a particular genotype. It is this latter portion of the process that has presented difficulty for researchers.
In this regard, the analysis and interpretation of DNA data generally involves various PCR idiosyncrasies that must be analyzed in order to obtain an accurate interpretation of the DNA sequence. When the various PCR problems are combined with each other and with additional sources of background chemical and electrical noise, they result in genotype data that require careful subjective interpretation by an experienced scientist in order to correctly ascertain the true underlying genotypes. However, manual interpretation of genotypes is widely recognized as a fundamental rate-limiting step for high-throughput genotyping and large-scale genome research. While in most cases the analysis and interpretation can be performed with relative ease by experienced human experts, efforts to develop support software for automated genotype interpretation has achieved limited success.
Several approaches have been proposed to simplify the analysis and interpretation of DNA sequences, each of which addresses a subset of the sequencing problems, while other problems are exacerbated or left unresolved. Furthermore, the viability of each approach decreases as the scale of research increases to investigate more complex genetic contributions to disease.
One approach described by M. W. Perlin et al. in “Toward Fully Automated Genotyping: Genotyping Microsatellite Markers by Deconvolution,” American Journal of Human Genetics, vol. 57, pp. 1199-1210, 1995, has been the use of microsatellite markers with fewer repeating units. This approach reduces a phenomena known as stutter artifact by sharpening the stutter, but also reduces the polymorphism, informativeness and utility of the markers.
A second approach described by M. Litt et al. in “Shadow Bands Seen When Typing Polymorphic Dinucleotide Repeats: Some Causes and Cures,” BioTechniques, vol. 15, pp. 280-284, 1993, and by M. J. Brownstein et al. in “Modulation of Non-Templated Nucleotide Addition by Taq Polymerase: Primer Modifications that Facilitate Genotyping,” BioTechniques, vol. 20, pp. 1004-1010, 1996, has been marker-specific modification/customization of PCR conditions to remove signal artifacts. This approach works to a point, but generally does not completely remove artifacts that are intrinsic to the PCR amplification of repetitive units. Additionally, differences in allele size, enzyme concentration, and other experimental factors can have a significant impact on the results. Further, the application of marker-specific PCR conditions is time and labor intensive and generally, a single set of PCR conditions is desirable for consistency and high throughput.
A third approach described by A. Edwards et al. in “DNA Typing and Genetic Mapping with Trimeric and Tetrameric Tandem Repeats,” American Journal of Human Genetics, vol. 49, pp. 746-756, 1991, by A.-K. B. Lindqvist et al. in “Chromosome-Specific Panels of Tri- and Tertanucleotide Microsatellite Markers for Multiplex Fluorescent Detection and Automated Genotyping: Evaluation of Their Utility in Pathology and Forensics,” Genome Research, vol. 6, pp. 1170-1176, 1996, and by T. J. Hudson et al. in “PCR Methods of Genotyping,” Current Protocols in Human Genetics, vol. 1, pp. 2.5.1-2.5.23, 1997, has been substitution of dinucleotide repeat markers with trinucleotide and tetranucleotide repeat markers that are less subject to signal artifacts and easier to interpret. While this approach reduces stutter artifact, it also reduces marker informativeness. Moreover, trinucleotide and tetranucleotide markers are much less prevalent in human genome. Additionally, in some cases, the prominent dinucleotide repeat stutter pattern can be used to distinguish alleles from noise peaks. Further, larger repeat sizes consume larger size windows (relative to their polymorphism) on the polyacrylamide gel, thereby reducing throughput by reducing the ability to multiplex markers.
A fourth approach described by J. S. Ziegle et al. in “Application of Automated DNA Sizing Technology for Genotyping Microsatellite Loci,” Genomics, vol. 14, pp. 1026-1031, 1992, and by D. C. Mansfield et al. in “Automation of Genetic Linkage Analysis Using Fluorescent Microsatellite Markers,” Genomics, vol. 24, pp. 225-233, 1994, has been analyzing the alleles on the basis of the highest peaks and ignoring the others. This approach succeeds when alleles are widely separated, but fails for closely spaced alleles, complex stutter patterns, and other signal complexities.
Finally, a fifth approach described in U.S. Pat. No. 5,541,067 to Perlin entitled “Method and System for Genotyping,” and by M. W. Perlin et al. in “Toward Fully Automated Genotyping: Allele Assignment, Pedigree Construction, and Recombination Detection in Duchenne Muscular Dystrophy,” American Journal of Human Genetics, vol. 55, pp. 777-787, 1994, has been the use of an explicit mathematical model to remove stutter artifact from genotype data by deconvolution. This approach works well for stutter artifact, but does not adequately address other types of signal artifacts and their covariance with stutter artifacts. Additionally, this approach models the stutter artifact as a reproducible response, which is relatively intolerant of noise and the variability of experimental data.
However, as stated above, each of the foregoing idiosyncrasies require careful subjective interpretation and to date, support software for automated genotype interpretation has achieved limited success. Although it is now possible for a single technician to generate data for tens of thousands of genotypes per week, the requisite visual inspection and manual interpretation of genotype data is expensive, tedious, time-consuming, and prone to error. Furthermore, the analyses must be performed by skilled experts that are not abundant in the current workforce. Therefore, a significant obstacle to fully automated genotyping is the analysis and interpretation of data.