During routine sequencing of DNA from samples (such as HIV genotyping after RT-PCR conversion from RNA to DNA), normally only one strand (forward or reverse) of the DNA is actually sequenced. In this case, the researcher must decide whether the output signal, and the resulting basecall is accurate based on their experience and skill in reading sequence signals. If the signal and resulting basecall is of questionable reliability, then the researcher must start the sequencing run again in the hope of obtaining a better signal.
In some cases, the forward and reverse stands are both sequenced, such as by using two dyes on a MICROGENE CLIPPER sequencer manufactured by Visible Genetics Inc. Forward and reverse strand sequencing provides the researcher with more information and allows the researcher to evaluate the quality and reliability of the data from both strands. If the bases on both strands complement each other as expected, then this helps to confirm the reliability of the sequence information. However, in some instances, after the signal data from sequencing is assigned a base (e.g. A, C, G or T), the corresponding base on the opposite strand does not match. If the signal and resulting basecall is of questionable reliability, then the researcher must start the sequencing run again in the hope of obtaining better signal. Alternatively, the researcher might manually review (xe2x80x9ceyeballxe2x80x9d analysis) the signal data from both the forward and reverse strands and make a decision on which strand""s data was more reliable. Unfortunately, any such decision will vary between individual researchers and can lead to inconsistent determination of reliablity within the same sequencing run. Furthermore, this kind of eyeball analysis requires special training which makes it poorly suited for application in routine diagnostic applications.
It would therefore be desirable to have a method for sequencing nucleic acid polymers in which discrepancies can be resolved using automated procedures, i.e. using computerized data analysis. It is an object of the present invention to provide such a method, and an apparatus for performing the method.
In accordance with the invention, nucleic acid polymers are sequenced in a method comprising the steps of
(a) obtaining forward and reverse data sets for forward and reverse strands of the sample nucleic acid;
(b) determining the apparent sequence of bases for the forward and reverse data sets;
(c) comparing the apparent forward and reverse sequences of bases for perfect complementarity to identify any deviations from complementarity in the apparent sequence, any such deviation presenting a choice between two bases, only one of which is correct;
(d) applying a confidence algorithm to peaks in the data set associated with a deviation to arrive at a numerical confidence value; and
(e) comparing each numerical confidence value to a predetermined threshold and selecting as the correct base the base represented by the peak which has the better numerical confidence value, provided that the numerical confidence value is better than the threshold.
The confidence algorithm takes into account at least one, and preferably more than one of several specific characteristics of the peaks in the data sets that were not complimentary.