Mixtures of molecular compounds are often separated into their various constituents using chromatographic techniques, based upon their differential migration or movement through a sieving medium according to certain properties, such as molecular weight or affinity for a solid adsorbent. The separated constituent compounds may be visualized by a number of different techniques, most of which require that the constituent compounds be labeled with a molecule that emits electromagnetic radiation, such as a fluorescent dye. This radiation can be detected by an optical detector sensitive in the spectral range of emitted radiation and then converted to an electronic or visual signal indicating the identity, amount, and order of the labeled fragments.
Chromatographic methods are commonly used to determine the sequence of a nucleic acid sample. Such methods involve the electrophoretic separation of mixtures of nucleic acid chain-termination fragments representing a size-distribution of fragments terminating at each A, G, T and C of the nucleic acid, with each fragment being labeled with a detectable label specific to the base type (A, T, G, or C) of the last nucleotide base of the fragment (in the case of dye-terminator labeling chemistry). Alternatively the primer used in the sequencing reaction can be labeled. The chain termination fragments are electrophoretically separated in a gel medium according to the fragment size, resulting in a pattern of bands corresponding to the order of the terminal nucleic acid base type. An optical detector detects the signal emitted by the fragment labels in the order of migration and converts the signal to a visualized pattern of peaks representing discrete constituent terminal nucleotide bases of each fragment. The pattern of peaks can then be analyzed by signal processing technology and/or computer, to determine the order, quantity, and identity of the terminating base type (and hence the sequence) of the individual components nucleic acid sample. Data acquired by an electrophoresis-based instrument, such as a slab-gel or capillary system) is known as a chromatogram or data trace, which provides a chronological series of peaks representing the nucleotide sequence.
Because chromatographic methods of nucleic acid sequencing utilize an electrophoretic sieving medium to separate DNA fragments on the basis of size, the accuracy of the sequence results depends on accurate detection of the chronological order in which the fragments migrate through the medium, as indicated by the presence and order of signal peaks representing individual fragments in an chromatogram or sequence data trace. Failure to identify a peak will result in loss of a base (called deletion error) in the identified sequence where a base actually exists. Identification of a false-positive peak (a peak that does not in fact represent a real nucleotide fragment) will result in a nucleotide/base being inserted (insertion error) in the identified sequence where no base actually exists.
Accurate identification of the order, identity and quantity of constituent components (e.g., nucleic acid base types) of a chromatographic separation process is critical for many applications. However, the accuracy of current methods is limited by a number of factors. First, the spacing of peaks produced by fragments differing in length by a single nucleotide tends to change with size of the fragment. Differences in the spacing of bands among multiple lanes also contributes to inaccuracies. Additionally, the electromagnetic radiation emitted by the detectable label is inherently stochastic in nature, resulting in a spread or dispersion of the signal. Background noise is also inherent, and contributes to a low but variable pattern of visual darkening or visual signal over the lane and in the peaks representing the signal. The general intensity of labeling often varies between the four nucleotide types, and there is furthermore a tendency for bands within a given lane to vary in relative intensity in an unpredictable manner. Consequently, signals generated by the detectable labels of the components are not discrete, and often result in overlapping peaks, which tend to occur frequently towards the end of the sequence, especially when there is a run of multiple components having the same identity (e.g., AAAAA, GGG or CCCC) which become convoluted and appear as a single peak. Overlapping peaks generally occur as a result of the reduction of resolution provided by a sieving medium with the length of the nucleic acid fragment. All of above factors contribute to difficulties in resolving individual constituent peaks, ordering of the peaks, and determining the correct sequence of bases.
Various methods have been utilized to circumvent the above problems and improve the accuracy of base-calling, including highly configurable data processing modules, homomorphic deconvolution followed by peak detection, neural networks, grid search assuming regularly spaced Gaussian pulses, expert systems, and others. The various methods generally fall into two categories: deconvolution methods and peak-fitting methods. Peak-fitting methods are based on empirical knowledge of the number, location, and characteristics of peaks of the same or a cognate sequence. Peak-fitting methods, however, require empirical knowledge of related sequences, and cannot be used where such empirical data is not available. Deconvolution methods, on the other hand, are based on an unbiased interpretation of data inherent in the peak data generated by the sample sequence, and involve an enhancement of the data by means of computational elimination or reduction of variables contributing to the blurring of the peak, which should theoretically result in an ideal discrete profile peak. Typical deconvolution base-calling methods use simple Fourier methods to predict base positions and then find peaks in the data as regions about inflexions or concavities in the signal that exceed certain area thresholds. Deconvolution methods, however, have limited utility where such inflexions between peaks are not present. Deconvolution is also highly sensitive to noise.
Accordingly, there is a continuing need to develop improved methods of base-calling, particularly methods that are capable of resolving peaks in low-resolution regions of peak data.