This invention relates to a method of processing output signals from an automated electrophoresis detection apparatus, and to an apparatus which employs this method for sequencing nucleic acids.
One of the steps in nucleotide sequence determination of a subject nucleic acid molecule is interpretation of the pattern of nucleic acid fragments which results from electrophoretic separation of fragments, or reaction products, of a DNA sequencing reaction (the xe2x80x9cfragment patternxe2x80x9d). The interpretation, colloquially known as xe2x80x9cbase callingxe2x80x9d, involves determination from the recorded fragment pattern of the order of four nucleotide bases, A (adenine), C (cytosine), G (guanine) and T (thymine) for DNA or U (uracil) for RNA in the subject nucleic acid molecule.
The chemistry employed for a DNA sequencing reaction using the dideoxy (or chain-termination) sequencing technique is well known, and was first reported by Sanger et al. (Proc. Natl. Acad. Sci. USA 74: 5463-5467 (1977)). Four samples of nucleic acid fragments (terminating in A, C, G, or T(U) respectively in the Sanger et al. method) are loaded at a loading site at one end of an electrophoresis gel. An electric field is applied across the gel, causing the fragments to migrate from the loading site towards the opposite end of the gel. During this electrophoresis, the gel acts as a separation matrix. The fragments, which in each sample are of an extended series of discrete sizes, separate into bands of discrete species in a lane along the length of the gel. Shorter fragments generally move more quickly than larger fragments.
If the DNA fragments are labeled with a fluorescent label, an automated electrophoresis detection apparatus (also called a xe2x80x9cDNA sequencerxe2x80x9d) can be used to detect the passage of migrating bands in real time. Existing automated DNA sequencers are available from Applied Biosystems, Inc. (Foster City, Calif.), Pharmacia Biotech. Inc. (Piscataway, N.J.), Li-Cor, Inc. (Lincoln, Nebr.), Molecular Dynamics, Inc. (Sunnyvale, Calif.) and Visible Genetics Inc. (Toronto). Other methods of detection, based on detection of features inherent to the subject molecule, such as detection of light polarization as disclosed in U.S. Pat. No. 5,543,018 which is incorporated herein by reference, are also possible.
A significant problem in determining a DNA sequence, encountered particularly with high speed DNA sequencing and in sequencing apparatus which do not combine the four sets of sequencing reaction products in a single lane, is alignment of data signals from the four different output channels of an automated DNA sequencing apparatus. Once data is aligned properly, it is relatively straight-forward to base-call it. However, this initial step can be very challenging since the output signal may be erratically shifted and/or stretched as a result of chemistry and gel anomalies. A reliable method of aligning data, that can produce data which takes into account non-linear shifting and stretching of signal output, is highly desirable particularly for high-speed DNA sequencing.
Existing prior art determinants in this field are very limited. Existing automated sequencers traditionally operate at voltages low enough that non-linear shifting is avoided. The use of low voltages, however, limits the speed with which separation of sequencing fragments into discrete bands can be accomplished.
Published methods of computer assisted base calling include the methods disclosed by Tibbetts and Bowling (U.S. Pat. No. 5,365,455) and Dam et al (U.S. Pat. No. 5,119,316) which patents are incorporated herein by reference. Both patents assume alignment of output signals and address only aspects of base-calling from the aligned signals.
It is an object of the present invention to provide a method of aligning real-time signals from the output channels of an automated electrophoresis apparatus.
It is a further object of the invention to provide an improved method of base-calling an DNA signal sequence aligned according to the invention.
It is still a further object of the invention to provide an apparatus for sequencing nucleic acids which utilizes the improved method in accordance with the invention for aligning real-time signals from the output channels of an automated electrophoresis apparatus.
These and other objects of the invention are achieved using a method for aligning data traces from four channels of an automated electrophoresis detection apparatus, each channel detecting the products of one of four chain-termination DNA sequencing reactions, whereby said four channels together provide information concerning the sequence of all four bases within a nucleic acid polymer being analyzed, comprising the steps of:
(a) identifying peaks in each of the four data traces;
(b) normalizing the height of said peaks in each of said data traces to a common value to generate four normalized data traces if the peaks are not of substantially equal height;
(c) combining the four normalized data traces in an initial alignment;
(d) determining coefficients of shift and stretch for selected data points within each normalized data trace, said coefficients optimizing a cost function which reflects the extent of overlap of peaks in the combined normalized data traces to which the coefficients have been applied, said cost function being optimized when the extent of overlap is at a minimum;
(e) generating warp functions for the normalized data traces from the coefficients of shift and stretch determined for the selected data points;
(f) applying the warp functions to the respective data trace or normalized data trace to produce four warped data traces; and
(g) assembling the four warped data traces to form an aligned data set.
The aligned data set may be displayed on a video screen of a sequencing apparatus, or may be used as the data set for a base-calling process.