1. Field
The present invention relates generally to the field of nucleotide investigations, and more particularly to the detection and analysis of emission spectra generated during observation of excited fluorphore labelled nucleotide polymers undergoing separation by size, such as is done during the sequencing of bases in nucleotide polymers.
2. State of the Art
The genetic material of higher organisms comprises two strands of DNA. Each DNA strand is a polymer of nucleotide monomers and each monomer consists of a sugar residue (deoxyribose), a phosphate residue, and a purine or pyrimidine base. The monomers are linked in a continuous chain by a phosphoribosyl backbone. The double stranded DNA prefers a helical orientation and exists as a long linear strand in higher organisms (up to several centimeters in length in man) with its phosphoribosyl backbone oriented outwardly of the helix and the sequentially ordered bases oriented inwardly along the axis of the helix whereby complementary hydrogen bonding between bases hold the two strands together. By complementary it is understood that adenine nearly always forms hydrogen bonds with thymine and cytidine with guanidine. The phosphoribosyl backbone has a free hydroxyl group at the 3' position extending from the terminal deoxyribose residue at one end and a free terminal phosphate group attached at the 5' position of its last deoxyribose residue at the other, thus giving a directional orientation to the opposing strands.
It is the sequence of the four bases found on the strands of DNA, (denoted A, G, T, and C), that is the genetic code directing the synthesis of all the polypeptides or proteins (enzymes, collagen, muscle, etc. synthesized as a linear sequence of amino acid monomers). These polypeptides perform the metabolic processes essential to life and health and provide structure and mobility to organisms. The code is based on a sequence of three bases, thus 4.sup.3 or 64 "code words" exist in the code. One triplet code is a start command, directing the initiation of synthesis of amino acid polymers (polypeptides), most triplets code for a particular amino acid to be added to the linear polypeptide chain, and a few triplet codes are stop commands directing the termination of synthesis of the polypeptide. A gene consists of the series of triplet codes, that is, the DNA sequence which directs the synthesis of a single protein. One gene codes for one protein. The industrial and research community desire to learn the sequence of DNA in all genes in humans and some other organisms and thereby harness this genetic code for a variety of useful purposes. With over 3.times.10.sup.9 bases making up the genes in humans, the enormity of the task of determining their sequence as they occur in the genes is readily appreciated.
While the above discussion has used the term DNA and referred to DNA sequencing, and uses the terms DNA and DNA sequencing hereinbelow, it is understood that the invention has application to sequencing methods of any nucleotide polymer, e.g., amplified microsatellite nucleotide polymers and other methods involving the use of fluorescently tagged nucleotide polymer fragments used to generate a chromatogram.
Two methods of sequencing form the basis for large scale sequencing operations, a so-called chemical method and an enzymatic method. The enzymatic method exploits the process of DNA replication which always occurs in the 5' to 3' direction by the addition of a new nucleotide to the 3' terminus of the growing DNA polymer catalyzed by an enzyme, DNA polymerase. The process is known as primer extension and the method of sequencing upon which it is based is the enzymatic or dideoxy method of DNA sequencing. Sanger et al., Proc. Natl. Acad. Sci., U.S.A. 74, 5463-5467 (1977). The chemical method of DNA sequencing was developed by A. M. Maxam and W. Gilbert, and is described in Proc. Natl. Acad. Sci., Vol. 74, p. 560 (1977). Each method is well known and well described in the references cited above and are equally applicable to the invention. Suffice it to say, they involve a number of steps and result in fragments of DNA of varying sizes that end with a different base (A, T, C, or G). The determination of DNA sequence in these methods depends on separating the DNA fragments produced by order of size and either by what base they contain (when each lane has only one reaction product) or by what fluorophore tag is detected if all four reaction products are in one lane as in commercially popular sequencing machines. If the shortest fragment ends in A, then the first base in the sequence is A. If the next longest fragment ends in T, then the next base in the DNA sequence is T and so on. This is the basic algorithm for "base calling", i.e., determining the sequence of purine and pyrimidine bases in a strand of DNA.
One commercially popular automatic sequencer, the ABI 373A.RTM., available from Applied Biosystems, Inc., Foster City, Calif., performs the following steps after a nucleotide polymer is sampled and reaction products of varying length obtained. The reaction product fragments are tagged with a fluorophore, resolved by size by inducing them to migrate through a polyacrylamide gel via an electrical charge across the gel (gel-electrophoresis), exposed to an electromagnetic wave source to induce the emission of electromagnetic energy (fluorescence by the tag), and the emitted energy detected by a detector to produce an analog signal. The analog signal is sampled and the sampled values transmitted to a data file referred to as a gel file. The gel file data is then "tracked" and processed by ABI Sequencer.RTM. analyzer software which generates chromatogram data and stores it in a chromatogram data file. The software then automatically determines the DNA base sequence from the chromatogram data and stores the sequence data as part of the chromatogram data file. Examples of patented automatic sequencing apparatus and methods include U.S. Pat. No. 4,811,218 to Hunkapiller et al. issued Mar. 7, 1989 assigned to Applied Biosystems, Inc. (ABI), and U.S. Pat. No. 5,556,790 to Pettit issued Sep. 17, 1996, the disclosures of which are incorporated herein by reference. These methods and such commercially available instruments as the ABI 373A.RTM. as discussed and the Pharmacia A.L.F..RTM., from Pharmacia, Inc. of Piscataway, N.J., and the Licor.RTM. Sequencer from Licor of Lincoln, Nebr. all produce a chromatogram data file from an analog signal in a manner compatible with the initial steps of the present invention. It is understood that should newer methods of creating chromatogram data files be produced, they too would be compatible with the invention. In addition to the instruments and methods discussed above, other methods employing capillary electrophoresis can be used to produce a data file compatible with the initial steps of the invention. The initial steps of producing reaction products is the same, however, a gel is not used during the fragment separation step and, at least in one commercially popular machine, a CCD camera is used to detect fluorophore emission spectra. Other prior art methods not necessarily directed to gene sequencing, such as microsatellite amplification, employ fluorophore labeled nucleotides and generate signals that can be converted into chromatogram data files as well. These and yet to be developed methods which produce a signal that can be converted into a digital data file such as a chromatogram data file are compatible with the initial steps of the invention.
Some current commercial automated sequencers utilize a single gel plate which can accommodate up to 64 migration lanes simultaneously, that is, 64 unique DNA samples. The multiple lanes are generally run through the detector and detected simultaneously to increase the throughput of the sequencer. A single run on such a gel can result in collection of between 4000 and 9000 data points (one each 6 seconds) for each sample by means of intermittent sampling of the raw data generated by the detector for a gel plate run and saving the collected data generated in a gel file. This process requires between 4 and 12 hours depending on the size of the longest DNA fragments under analysis, and, consequently, the migration time which lengthens with the length of the fragment. As indicated above, the gel file data is interpreted by software and the interpreted data then stored in a so-called chromatogram data file. A chromatogram could be plotted out on paper or on a computer screen if desired.
The existing "base calling" process in automated sequencers consists of determining the DNA base sequence from the chromatogram data without the necessity of plotted graphs, except when the data is too ambiguous. Then plotted graphs must be resorted to. One of the most labor intensive and highly skilled tasks during DNA sequencing projects is viewing the original trace descriptions of the gels and resolving conflicting readings. J. Bonfield and R. Staden, The application of numerical estimates of base calling accuracy to DNA sequencing projects, Nucleic Acid Research, Vol. 23, No. 8, pp. 1406-1410, 1995.
Ambiguities result and limitations are imposed upon the length of DNA strands which can be sequenced by factors inherent in current methods of tagging DNA, variations inherent in gel electrophoresis, inherent inconsistencies in the make up of the sample such as heterozygosity and other polymorphisms, and current methods of base calling including especially the available software for base calling. All these variables can result in ambiguous information such that accurate base calling is interfered with. In general, it is an object of the invention to provide a means for resolving ambiguities due to the above factors. It is a further object to identify mutated genes, homozygous and heterozygous loci within exons, introns, or, nucleotide polymers, in general, and other polymorphic anomalies from chromatographic data.
First, a closer examination of one particularly important problem that has had no satisfactory solution to date, i.e., the problem arising during gel electrophoresis. During the process of running the gel, a number of stochastic phenomena occur to change the migration speed of the DNA primer extensions within a lane and from lane to lane and cause the data collected to be nonsynchronous among the lanes. Contributing factors to nonsynchronicity include: microscopic holes in the gel matrix as a result of rate of polymerization and quality of acrylamide, break down of the polyacrylamide matrix during running, changes in migration speeds due to electrical idiosyncrasies, temperature variability throughout the gel, and variability in salt concentrations in the running buffer. All these factors combine to have an overall effect of stretching or compressing the migration speed of each sample (and the x axis of the chromatogram). The result is that two identical DNA samples, run on the same gel, in different lanes will have different electropherogram data. The effect is even more dramatic when the samples are run on different gels or on different machines. It is an object of the invention to provide means for correcting for nonsynchronicity among the lanes of one run on one instrument, among different runs, and among runs performed on different instruments.
Currently, the most expeditious way to detect differences between experimental samples and a reference is by comparing the text data, e.g., A-C-G-T-T-G-G-, for the two samples using one of the several programs available. All these prior art software programs are based on the comparison of the text strings, e.g., AC-G-T-T-G-G-. The text strings are generated by the Applied Biosystems, Inc. basecaller software when generating the chromatogram data. Such software makes base determinations by considering peak height and peak spacing. It does not consider peak size, area under the peak, presence of different colored peaks (peaks at different electromagnetic wavelengths occurring at the same migration time), peaks that are out of synchronization, or time factors involved in evolution of the peak, or quantify these variables for consideration in the base calling algorithm. In other words, there is a lack of software for analyzing and quantifying more than two of the many variables and chromatogram characteristics contained within a chromatogram data file. It is an object of the invention to consider various aspects of the chromatogram data for base calling and for detecting differences between experimental samples.
Problems with the currently available base calling algorithms arise when ambiguous data results as a consequence of so-called "contextual influences" resulting from current sequencing methods and perhaps polymerases employed in such methods. L. T. Parker et al., Peak Height Variations in Automated Sequencing of PCR Products Using Tag Dye-Terminator Chemistry, BioTechniques, Vol. 29, No. 1, pp. 116-121. For example, the signal emitted by the fluorophore attached to G., i.e., the G peak, following an A peak, can be weak and is, perhaps the most noticeable contextual influence. However, the G peak following C and T is also weaker. The peaks for A or T fluorescence is very strong when following a G. Such "contextual influences" or ambiguities can be resolved by sequencing the opposite direction across the problematic region, because the same problem will not be observed on the chromatogram of the primer extension reaction product using the reverse complement of the ambiguous template. For example, if the template sequence is 5'-AG-A-G-T-G-C-T-C-3', the first two G peaks might be ambiguous and difficult to call because they will follow A peaks, but the peak following the T residue may be clear. However, in the reverse complement of the template strand, the sequence is 5'-G-A-G-C-A-CT-C-T-3', the C-T-C-T portion representing the reverse complement to the A-G-A-G portion of the template. The chromatogram will show an unambiguous C peak followed by a T peak followed by another C peak followed by another T peak, so none of the ambiguities present in the target template are present in the reverse complement of the problematic region.
With a heterozygous polymorphism or other polymorphism, there is a variation in base sequence between a sample and a reference at a particular locus, i.e., one base may appear on one allele, but a random, different base will occur on the other allele, and/or the base on the otherwise complementary strand will not be the usual complementary base pair at the same locus. Such polymorphism will look exactly like an ambiguous base call in both forward and reverse sequencing reactions. Polymorphisms occur not only within exons, but within introns as well, and, therefore, the problem arising with sequencing polymorphic samples applies equally to both. Heretofore, there has been no commercially practical, simple means for determining whether an ambiguous base is a heterozygous polymorphism or artifact. Automated methods have compared only text strings which can point out differences but not resolve ambiguities that underlie the differences. Manual inspection of chromatograms can resolve questions, but is difficult and labor intensive. Hence, there has been no practical method to fully compare chromatogram data from two or more data files. It is an object of the invention to provide a method and apparatus for importing digital data files derived from signals generated during nucleotide sequencing of two or more samples, such as chromatograms from a reference person or group and a chromatogram from a potential carrier of a mutation, and to compare two or more chromatograms and distinguish between ambiguous peaks and true heterozygous polymorphisms and/or other polymorphisms, i.e., where a peak corresponding to one base is found in one chromatogram and the corresponding peak in the comparison chromatogram is for an unambiguously different base.
Much work has been done in the area of automatic speech recognition and computerized speech processing. Various approaches to analyzing speech signals have been developed and are in use to allow computers to analyze and compare digital data representative of analog speech signals. Information on speech processing is contained in the book Neural Networks and Speech Process, David P. Morgan and Christopher L. Scofield, Kluwer Academic Publishers (1991), and such information is incorporated herein by reference.