Data preprocessing aimed at reducing the amount of data and extracting relevant information from multi- or high-dimensional data is a step of many data analysis techniques. For example, a Liquid Chromatography/Mass Spectrometry (LC-MS) data set may consist of several hundred scans with a broad mass range, e.g., from app. 50-100 Da to several thousand Da (typically 2000-10000 Da) expressed in mass/charge (m/z) values. A data set of a single measurement consists of millions of data points that include a significant amount of information with little or no value. (e.g., Both electrical and chemical noise, non-relevant ‘real’ signals coming from mobile phase components, ion source contamination, signals of bleeding of chromatographic material). Due to the number of data points, a manual selection of relevant information is not imaginable, at least in practical applications; therefore a technological approach using a suitable algorithm is necessary.
For many 2-dimensional, 3-dimensional or even higher dimensional data sets, like LC-MS data, run-to-run variation on within the dimensions is observed as having a detrimental effect on pattern recognition analysis. A correct allocation of signals of the same substance in a collective of data sets (like measurements of more than one sample) is an important premise of a proper pattern recognition application. A false assignment of peaks to a chemical individual within the pattern reduces the possibility to find the ‘true’ pattern.
In an LC-MS data set, the variability of retention times can have various causes, such as inhomogeneity of gradient formation, fluctuation of flow-rate, overloading of the chromatographic column, chemical and mechanical changes due to the ageing of the chromatographic materials. The variability of mass/charge measurement depends on factors such as the accuracy of the mass detection, mass-to-charge value, intensity values or the signal/noise ratio, the generating of centroided spectra from continuous ones.
Many chemometric methods deal with data preprocessing of LC-MS data. The majority of these methods extract an informative part of data sets using an algorithm analyzing the data in one-dimension only. Some of the methods analyze the data in both dimensions simultaneously resulting in substantially higher quality of preprocessed data.
Other Approaches
1) J. Chromatogr A 771, 1997, 1-7: “Application of sequential paired covariance to liquid chromatography-mass spectrometry data; Enhancements in both the signal-to-noise ratio and resolution of analyte peaks in the chromatogram”, David C. Muddiman et al. The article provides that a sequential paired covariance (SPC) method generates a series of virtual amplified mass spectra. Each data point in a mass spectrum is multiplied with the corresponding data point from the following mass spectrum—resulting in a geometrically amplified spectrum; the number of spectra used in each multiplication operation defines the order of the covariance algorithm. Thus, dramatic enhancement of the S/N ratio and the resolution in the chromatogram is achieved; however the algorithm can be used for qualitative analysis only because the absolute quantitative information (both peak area and height) is getting lost by multiplying the consecutive data point.2) Analytica Chemica Acta 446, 2001, 467-476: “Fast interpretation of complex LC-MS data using chemometrics”, W. Windig at al.3) U.S. Pat. No. 5,672,869 entitled “Noise and background reduction method for component detection in chromatography/spectrometry” provides a component detection algorithm (CODA) that extracts from LC-MS data a compound's information by random noise, spikes and mobile phase peaks elimination. It uses the assessment of differences between original chromatogram and its smoothed form for spiked elimination using a similarity index having a value between 0 and 1 that is user specified. In order to detect a chromatogram representing solvent background, a comparison of an average value of all data points within the selected mass chromatogram was used.4) JChromatogr A 849, 1999, 71-85: “Windowed mass selection method: A new data processing algorithm for liquid chromatography-mass spectrometry data”, C. M. Fleming et al. In the reference, a method termed ‘windowed mass selection method’ (WMSM) is shown to eliminate random noise that occurs in the data. The preprocessing method consists of two steps to remove random background noise, and is based on the main assumption that analytes can be distinguished from noise by means of differences in peak width. The disclosed system makes a number of assumptions including:                1. Any peak has a non-zero signal over the length of the window.        2. A characteristic of random noise is that it does not have a constant signal over a number of scans defined by a window, but intermittently displays zero-amplitude intensities. Multiplication of intensities over a window range will result in zero signal.        3. A low consistent background is removed by subtraction of a mean value of each chromatogram from this chromatogram.        4. Mobile phase peaks are removed by selection criteria which set the maximum length of a theoretical peak. If the peak is longer than the maximum allowed value it will be removed from data set.The assumptions of this method do not include many eventualities occurring in the LC-MS data set (e.g. overlapping peaks, long noisy regions with fluctuating intensity values). A benefit over SPC method could be, in principle, preservation of absolute intensity values. However, those intensities would require correction of the intensity values after background subtraction.5) Singular value decomposition method: The singular value decomposition method (SVD) is a method for data compression and noise reduction by eigenvalue-like decomposition for rectangular matrices. Characteristics of this method are provided in Fleming et al.; JChromatogr A 849, 1999, 71-85 and in references cited therein.6) WO 02/13228 A2 (Method and system for identifying and quantifying chemical components of a mixture, Vogels et al.) discloses a method of data processing and evaluation consisting of the steps of smoothing the data point of chromatogram and determining an entropy value for a smoothed chromatogram (chromatogram may be either a selected mass or total ion chromatogram). After evaluation of a quality factor (based on an entropy value) for each smoothed mass chromatogram in the data set, the algorithm generates a reconstructed total ion chromatogram from selected mass chromatograms with the IQ values above a defined threshold value.7) U.S. Pat. No. 5,995,989 A1 (Method and apparatus for compression and filtering of data associated with spectrometry, Gedcke et al.) discloses a method and apparatus for compression and filtering of data associated with spectrometry. The method monitors a value of each data point and compares it to the previously data point to determine whether it is on or very near a peak. The intensity values for a designated number of data are summed and averaged to determine the average of a noisy background. A threshold is determined by multiplying the deviation by a empirically defined constant k, each data point is compared to this threshold value.8) US 2002/0193950 A1 (Method for analyzing mass spectra, Gavin et al.) discloses a method that analyzes mass spectra. The analysis consists of detecting signals above S/N cutoff, clustering of signals, pre-selection of features, identification mass values for selected clusters, creating of a classification model and assignment of unknown sample. This method is predestined for 1-dimensional signals, like MALDI, SELDI or ESI-MS spectra without a time-dependent separation prior the chromatographic detection.        
The document focuses on a classification model having classes characterized by different biological status. In this context a feature pre-selection using a cluster analysis is described. Signal clusters having a predetermined number of signals (here: biological samples in which the signal is present) are selected for the classification model, clusters having less signals are discarded.
The possibility of preprocessing raw data is considered only briefly in the document. To this end it is mentioned that the data analysis could include the steps of determining the signal strength (e.g. height of signals) of a detected marker and to remove “outliers” (data deviating from predetermined statistical distribution).
9) US 2003/0040123 A1 (Peak selection in multidimensional data, Hastings) discloses a method of computing local noise thresholds for each one-dimensional component of the data. Each point has a local noise threshold applied to it for each dimension of the data set, and a point is selected as a peak candidate only in the case its value exceeds all of the applied local noise thresholds. Contiguous candidate peaks are clustered into actual peaks (i.e., detected real chromatographic peaks).
A noise threshold can be computed from a window of points surrounding the particular point. After peak picking, additional criteria can be applied to the peaks before they are accepted into a peak database. With respect to the selection of actual peaks it is considered that additional peak recognition algorithms, such as line shape analysis or Bayesian/maximum likelihood analysis for mass chromatograms or isotope distribution analysis for mass spectra may also be applied. Details are not given. With respect to the peak picking it is also considered that the noise could be reduced by using a suitable filter on the basis of a known noise distribution, so that peaks can be detected. The method disclosed in US 2003/0040123 A1 addresses the noise issue, in particular the particularities of noise in LC-MS data, by applying different noise thresholds to different dimensions of the data.
Review Articles of General Interest
A review of so-called data mining techniques which can be used for example with respect to mass spectrometry data can be found in Current Opinion in Drug Discovery & Development 2001 4(3), 325-331, “Data mining of spectroscope data for biomarker discovery” S. M. Norton et al. Of general interest is also the review article IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 2000, 4-37: “Statistical Pattern Recognition: A Review”, A. K. Jain et al., which considers issues such as feature extraction and selection, cluster analysis and generally so-called data mining on the basis of statistical methods including Bayesian statistics.