Automated DNA sequencing presents a number of challenges to the data analysis process. The input data can be highly variable and predictive models of data behavior are lacking, yet computer analysis routines are expected to produce highly accurate output data.
Base-calling is the data analysis part of automated DNA sequencing, which takes the time-varying signal of four fluorescence intensities and produces an estimate of the underlying DNA sequence which gave rise to that signal.
Generally, base calling software works by applying a “model” of various phenomena, such as diffusion and smearing, differences in mobility of the various dyes and sequence-specific variations in migration, to observed spectral data in order to predict the true behavior of the fragments and how they are separating. The more realistic the model, the better an algorithm can deconvolve the raw signal data into a representation of true fragment separation order and, hence, the sequence (and/or fragment size). The widely-used algorithms provide only very simple and approximate models.
Although each of these sources for improvement has been dealt with to some degree of success, it is clear that a more integrated approach is needed to make a breakthrough advancement over the current methods. Particularly useful would be a modification to the sequencing process that can provide an improved algorithm with the added information needed to develop a more refined view of the true data signal and a realistic model of the separating fragments.
A major source of basecalling error is the incorrect estimate of spacing, i.e., knowing when fragments associated with a particular base will cross the detector. This is especially noticeable in homopolymer regions late in the run; e.g., a run of say, 5 As can be incorrectly called as 6 As. Currently, a number of separate calibration runs and extensive analysis produce hard-coded spacing curves. However, variations in the applied running conditions or simply uncontrolled experimental variation can produce fragment separation profiles that deviate substantially from the hard-coded curves. A dynamic method of determining peak spacing that is robust to different run conditions would substantially improve basecalling accuracy late in the run.
The mathematical method or technique of deconvolution provides another opportunity to improve basecalling accuracy. Deconvolution attempts to explain raw data as series of known peak shapes and has been shown to adequately separate overlapping peaks. However, deconvolution can produce spurious peaks if the true peak shape and size is not known in advance. A method of measuring the peak shape and width of known isolated peaks would improve deconvolution methods, which in turn would substantially improve basecalling accuracy, particularly late (further along) in the run.