The present invention relates to the automated determination of the nucleic acid sequence of a polynucleotide. More particularly, the invention relates to an improved method and apparatus for determining the sequence of a nucleic acid, particularly DNA, that utilizes novel informative variables related to the nucleic acid sequence and an associated method and apparatus that improves the resolution of nucleic acid signals in the digitized data stream corresponding to the sequencing ladder.
Current methods of DNA sequencing rely upon electrophoretic separation of incremental oligonucleotides. These stochastic arrays of oligomers are produced usually by one of two methods. The Maxam-Gilbert method (Proc. Natl. Acad. Sci. USA, 74: 560-564 (1977)) is a chemical method used to randomly cleave the DNA strand while the Sanger et al. method (Proc. Natl. Acad. Sci USA, 74: 5463-5467 (1977) uses dideoxy terminators to halt the biosynthesis process of replication.
The prior art determinants of DNA sequence have been the spatial ordering of oligomers and/or the use of differential labels with basecalling accomplished in a deterministic manner using these indicators at particular points. Thus, each base was identified individually, apart from its neighbors. For instance, one conventional approach is to monitor the signal data stream and flag the system when the signal value suddenly starts decreasing as it passes through a maximum. Therefore, by locating successive maxima, the areas in the region of each successive signal can be located. A property related to the differential label, such as the ratio of fluorescence in the region of the maxima between more than one different channel, can then be used to identify the particular base located at the end of the oligonucleotide corresponding to the particular signal. However, this method suffers not only from the possibility of not locating successive peaks, but also from problems related to inaccurate background subtraction.
The instrumental design and operation of DNA sequencers varies from simple to elaborate. Yet a fundamental limitation of each of these systems is imposed by the separation and resolution of oligonucleotides through electrophoresis in DNA sequencing gels, such as denaturing polyacrylamide gels. The system of gel electrophoresis supports determination of DNA sequences from a single sample over a range from one to hundreds of nucleotides.
The manual, autoradiographic approach to the separation of oligonucleotides presents a static view of oligomer ladders after a fixed period of electrophoresis. Recently introduced, automated DNA sequencers enable real-time detection of oligomers by recording the signal emanating from the oligomer's fluorescent or radioactive labels as each oligomer of a sequencing ladder passes the instrument's detector(s).
Automation of the separation, data collection and analysis promises efficient and rapid operation and elimination of human errors in the transcription of results to DNA sequence files. Under ideal conditions of separation and resolution of the oligomers, the identification of successive terminal nucleotides is a straightforward exercise. Many DNA sequences, however, present local domains of anomalous oligomer yields or separations. Thus, errors in manual or automated DNA sequencing files are much more likely when either the separations of oligomers, the ratio of signal to noise, or both, are sub-optimal for trivial translation of the data. These errors appear as miscalled bases, extra or missing bases, or ambiguous and unidentified bases in the DNA sequence file.
Typical performance with contemporary automated DNA sequencing systems and the scanner/readers for sequencing gel autoradiograms is on the order of 90% to 97% correctly identified bases in any single sequencing run. For any particular DNA sequence, however, this level of performance, with automated data acquisition and translation, is seldom much worse than the results of a manual DNA sequencing analysis. The automated systems reliably generate data of comparable quality in less time, with less labor intensive effort, and with lower quantities of costly reagents.
Single strand error rates of 1% are often accepted as tolerable DNA sequencing performance because comparison with complementary strand sequence data should reduce the error rate to about 1 per 10,000 base pairs. However, this accuracy is feasible only if each mismatch of sequence and complement is correctly recognized and correctly reconciled. Even then, error rates of 0.10% to 0.01% are in the range of one mistaken base per gene. This error rates approximates the level of variation among alleles of a gene pool, some of which may correlate with severe burdens of inherited pathology.
In practice, comparisons of complementary single strands with error rates of 1% initially present about 1 mismatch per 50 base pairs, necessitating the identification and reconciliation of many sequence mismatches by the professional investigator. Practical experience of many investigators, over several years of manual DNA sequencing, indicates that the reconciliation of complementary strand mismatches is challenging, but feasible. The process is clearly tedious and time consuming. Furthermore, criteria for objective reconciliation of errors are not known.
Because small improvements in accuracy in the translation of raw sequencing data has a substantial impact on the level of undetected errors in the finished DNA sequences and, thus, the cost of DNA sequencing, there is a need for a method and apparatus to improve the accuracy of the finished DNA sequences. The instant invention solves this problem by providing methods and apparatus for both the enhanced separation and resolution of signals emanating from the labeled oligonucleotides and the improved determination of nucleic acid sequence employing novel informative variables related to the local nucleic acid sequence.