This application relates to DNA sequencing technology and in particular to a method for alignment of DNA sequences which provides enhanced accuracy and read-length.
DNA sequencing is generally performed today using one of two methodologies: the chemical degradation method or the chain termination method. Of these, the chain termination method originally described by Sanger et al., Proc. Natl. Acad. Sci. USA 74: 5463-5467 (1977) or variations thereof have been adopted in many cases for development of automated sequencing instruments and protocols.
In the chain termination sequencing method, fragments are generated using chain termination reagents in a template-dependant polymerization reaction. The lengths of the fragments indicate the positions of one species of base in a target polynucleotide. If fragment sets are generated for each of the four species of bases (A, C, G and T), analysis of the fragment sizes permits the explicit determination of the sequence of the target polynucleotide. While the translation of this conceptual methodology into practice is effective for determination of sequences, the application in automated systems has faced numerous challenges. These include the fact that the band shape produced following electrophoresis of real fragments is not consistent from one band to the next and may not be perfectly straight (smiling may occur); variations which can occur in peak spacing from one lane of a gel to the next; variations in peak spacing which can occur as the length of the run increases; and decreases in resolution which occur as the length of the run increases. Furthermore, since much of the cost associated with DNA sequencing is in the set-up time involved, for clinical and diagnostic applications the larger the length of DNA which can be sequenced with accuracy, the smaller the per patient cost can be. These considerations have led to a variety of proposals for improving the chemistry used in sequencing, or for improving the manner in which data representing the detected sequencing fragment is processed. The present invention relates to the second type of improvement.
In order to obtain meaningful sequence information from raw data obtained by electrophoresis of labeled sequencing fragments, one of the most important factors is the alignment of the data traces representing each species of base. In non-automated systems, this is frequently done by eye-ball, and the eye of a skilled technician is in fact a remarkable tool for this purpose. Commonly assigned U.S. Pat. No. 5,916,747, which is incorporated herein by reference, discloses a method for aligning data traces from four channels of an automated electrophoresis detection apparatus in which each channel detects the products of one of four chain-termination DNA sequencing reactions such that the four channels together provide information concerning the sequence of all four bases within a nucleic acid polymer being analyzed. The method places the four data traces in a trial alignment, and then determines coefficients of shift and stretch for selected data points within each normalized data trace to optimize a cost function which reflects the extent of overlap of peaks in the combined normalized data traces to which the coefficients have been applied. Warp functions are then generated for the normalized data traces from the coefficients of shift and stretch determined for the selected data points, and applied to the respective data trace to produce four warped data traces which are assembled to form an aligned data set. This data set is then used for base-calling to complete the sequence determination process.
The procedure of the ""747 patent is generally suited for the determination of sequences where explicit data for the positions of all four bases are obtained. On the other hand, it is not always necessary to determine the positions of all of four species of bases in order to obtain diagnostic information from a given polynucleotide. (See, commonly assigned U.S. Pat. No. 5,834,189, which is incorporated herein by reference). Commonly assigned U.S. Pat. No. 5,853,979, which is incorporated herein by reference discloses a method for the interpretation of experimental fragment patterns for polynucleotides having putatively known sequences. In this method, at least one raw fragment pattern representing the positions of a selected nucleotide base as a function of migration time or distance is obtained for the experimental sample. The fragment pattern is evaluated to determine one or more xe2x80x9cnormalization coefficients.xe2x80x9d These normalization coefficients reflect the displacement, stretching or shrinking, and rate of stretching or shrinking of the clean fragment, or segments thereof, which are necessary to obtain a suitably high degree of correlation between the clean fragment pattern and a standard fragment pattern which represents the positions of the selected nucleic acid base within a standard polymer actually having the known sequence as a function of migration time or distance. The normalization coefficients are then applied to the fragment pattern to produce a normalized fragment pattern which is used for base-calling in a conventional manner. As indicated, however, this technique requires prior knowledge of the expected fragment pattern for the polynucleotide being analyzed.
Notwithstanding such techniques, there remains room for improvement in the manner in which automated analysis of sequencing fragment patterns are carried out. In particular, there remains a need for systems which allow enhanced read-length, i.e., the analysis of a greater number of bases in a single lane of a gel, without loss of accuracy or substantial increase in analysis time. It is an object of the present invention to provide a method which answers this need.
The present invention provides a method for aligning sequence data traces. In accordance with the invention, an experimental data trace representing the positions of a first species of base within a target polynucleotide and a reference data trace representing the positions of a second species of base (which may be the same as or different from the first species) within a reference polynucleotide are obtained by separating appropriate sequencing fragments generated from the target and reference polynucleotides in a common lane of an electrophoresis gel. For each reference data trace, a plurality of peaks corresponding to fragments having a size in the range of 40 to 1200 bases are selected. A base number is assigned to each of the selected peaks in the reference data trace, and a numerical xe2x80x9cpeak filexe2x80x9d is created with information about the peak number and migration time (or distance). This peak file is analyzed to determine a set of polynomial coefficients which will allow substantial linearization of a plot of peak number versus separation between adjacent peaks and alignment of the traces with respect to each other. These coefficients are used to create a corrected time scale identifying where peaks should be located on a given experimental gel. This corrected time scale is used to guide the sampling of the experimental data, and for assignment of peaks within the data.