Extracting allele peak parameters from raw DNA data, e.g., electropherogram, chromatographic, or spectral data, is a task required in many disciplines. A problem associated with extracting allele peak parameters is the fitting of an analytical curve to a set of peak data to obtain a best fit of the data in order to find the peak model parameters that represent each peak.
Analytical equations are needed to model peaks. The basic shape of an ideal chromatographic peak is similar to that of a Gaussian curve. Many different equations have been used to model peaks in various chromatographic applications [7, 10, 13, 14, 16, 17, 20, 21]. Di Marco and Bombi [17] have reviewed 86 mathematical functions that have been proposed to fit chromatographic peaks.
After an analytical equation such as the Gaussian model has been chosen to model a peak, a method must then be chosen to extract the optimal parameters that would fit the data. An optimization approach is a common method for addressing this issue. It is important to choose an optimization method that will converge given the peak model equation. Many investigators employ commercial software to perform fits as opposed to developing their own system. The Microsoft Excel Solver tool [2], which uses the Levenberg-Marquardt method [15, 18], has been used frequently to perform fits [11, 16, 19, 23]. The in-house software developed in [14] also uses the Levenberg-Marquardt method.
There are two classes of optimization methods: constrained and unconstrained. The Gauss-Newton method is unconstrained, which can be implemented simply and is less computationally intensive than constrained optimization methods, but allows the solution parameter values to be arbitrary. Most of the reviewed publications used commercial optimization software rather than software developed ‘in-house’. Lack of control over the fitting protocols or parameters is a concern when using commercial software. In 1994 Walsh and Diamond [23] discussed the use of Microsoft Excel Solver for fitting curves to non-linear data. Their routine requires the user to select the data to be fitted, choose a model equation for the given data, and manually provide an initial guess of parameters for the model equation. They evaluated Solver's ability to fit chromatography peaks, fluorescence decay processes, and ion-selective electrode characteristics. For chromatography peaks, they provided a comparison of fitting using 3 different models: Gaussian, exponentially modified Gaussian (EMG), and tanks-in-series. Of these models, the EMG was found to best fit the given peak data. They did not elaborate on guidelines for guessing the initial parameters or the selection of the appropriate model to use for fitting a particular peak.
In 2000, Eanes and Marcus [11] extended the use of Microsoft Solver to fit Gaussian-like spectral peaks. Using Visual Basic, they developed a program to process multiple peaks simultaneously (up to 3) for radio frequency glow discharge ion trap mass spectra. The summation of multiple peak models was used to perform the multiple peak fits. The software used Microsoft Solver to perform the optimization and Visual Basic macros to provide a graphical user interface and manage the peak data. A dialog box is used for users to set initial parameters and various fitting settings for each set of peak data fitted. Multiple spectra of the same scan can be performed at the same time with the software, but user input appears to be required for each new peak or group of peaks to be processed. Adjustment to various peak models was accomplished by a generic template used by the software. Both Gaussian and Lorentzian peak models were implemented; however, each model type was implemented separately with a different program.
A comparison of the performance in fitting by a series of peak model equations was made by Li in 2001 [16] who performed fitting using an empirically transformed Gaussian, polynomial modified Gaussian, generalized exponentially modified Gaussian, and hybrid function of the Gaussian and truncated exponential functions. Li also used Microsoft Solver to perform the nonlinear optimization where the user explicitly set the initial parameters. The quality of fit was determined by the sum of squares of the residuals; the empirically transformed Gaussian function was found to give the best fit based on this quality measure. The four models used had varying degrees of complexity, and they did not explicitly give timing or iteration results for the fits. However, the hybrid function of Gaussian and truncated exponential functions performed second best among the functions tested and contained only four parameters compared to the seven parameters used in the empirically transformed Gaussian function.
An automatic curve fitting algorithm was described by Alsymeyer and Marquardt in 2004 [6]. The algorithm is automatic in that it does not require human interaction or a priori knowledge of peak initial parameters or other spectrum properties. The method fits the peaks in the entire spectrum with a single model and iteratively adds additional peak components to the model equation until a stopping condition is fulfilled generally based on a goodness-of-fit criteria such as the F-Test. Initial parameters for the peak are determined by analyzing the data where the largest fitting error was found from the previous model fit. The peak model equation used is the Voigt function, which is a convolution of the Lorentzian and Gaussian distributions. The method starts with a pure baseline model and it should be noted that each iteration of the algorithm adds a peak to the existing model, thus adding additional parameters to the model to be optimized. The authors state that a sequence of 50 peak functions takes overnight to run with the current implementation. The current implementation is written in MATLAB without explicitly attempting to optimize performance, and they suggest that the use of a compiled programming language could help improve performance. Infrared and Raman spectra were modeled and the authors were able to obtain results with a good fit with respect to fitting error; however, they add that the resulting models are not necessarily physically correct. This implies that the model obtained fits closely to the experimental data, but that the peaks the model describes may not represent the actual physical components of the spectra.
Steffen et al. in 2005 presented another automatic method for fitting peaks in complex chromatograms [22]. Their approach involved extracting peak shapes (parameterized models) directly from the chromatogram and forming a standard peak shape based on the average shape of typical peaks found in the given data. The method defines an improved baseline for the data and then automatically detects peaks using maxima of the signal and minima of the second derivative. It fits the peaks with a hyperbolic scaling function that is applied to the standard peak model. They mention that peaks that differ too much from the standard peak may need to be identified and possibly fitted with a differently shaped model, but do not elaborate on specific methods for accomplishing this task. They implemented the software in Scilab code [12] and performed fitting on a large number of chromatograms (around 1,000). Comparison of their fit results to that of traditional interactive analysis proved to be fairly compatible. Verification using synthetic chromatograms yielded good general results, but they had problems with detection of extremely small peaks in difficult synthetic chromatograms. They indicate that the heuristic parameters in the implementation can be utilized for refinement of the fitting to improve results and tune the system.
Also, expert systems are known in other disciplines which facilitate the inclusion of expertise in the decision making process. Consequently, there is a need in the art for an allele peak fitting framework to properly fit raw DNA peak data for subsequent calling of alleles by experts in the field, or by automatic peak processing by an expert system.