Liquid chromatography-mass spectrometry (LC-MS) is a well-known combined analytical technique for separation and identification of chemical mixtures. Chromatography separates the mixture into its constituent components, and mass spectrometry further analyzes the separated components for identification purposes.
In its basic form, chromatography involves passing a mixture dissolved in a mobile phase over a stationary phase that interacts differently with different mixture constituents. Components that interact more strongly with the stationary phase move more slowly and therefore exit the stationary phase at a later time than components that interact more strongly with the mobile phase, providing for component separation. A detector records a property of the exiting species to yield a time-dependent plot of the property, e.g., mass or concentration, allowing for quantification and, in some cases, identification of the species. For example, an ultraviolet (UV) detector measures the UV absorbance of the exiting analytes over time. When liquid chromatography is coupled to mass spectrometry, mass spectra of the eluting components are obtained at regular time intervals for use in identifying the mixture components. Mass spectra plot the abundance of ions of varying mass-to-charge ratio produced by ionizing and/or fragmenting the eluted components. The spectra can be compared with existing spectral libraries or otherwise analyzed to determine the chemical structure of the component or components. Note that LC-MS data are two-dimensional; that is, a discrete data point (intensity) is obtained for varying values of two independent variables, retention time and mass-to-charge ratio (m/z).
LC-MS data are typically reported by the instrument as a total ion current (TIC) chromatogram, the sum of all detected ions at each scan time. Peaks in the chromatogram represent separated components of the mixture eluting at different retention times. A noise-free chromatogram 10, shown in FIG. 1A, appears as a series of smooth peaks 12a-12c, each extending over multiple scan times. As shown in the TIC chromatogram of FIG. 1B, however, LC-MS data often have high-intensity noise spikes 14a-14d superimposed on the peaks. Although components elute over multiple scans, noise spikes typically do not extend beyond one scan time. If the TIC chromatogram has little noise, an operator can determine the total number of peaks and then examine each peak's corresponding mass spectrum to identify the eluted species. However, as the amount of noise present increases, it becomes more difficult for the operator to distinguish the chromatographic peaks, particularly if the noise level is higher than the signal level. In such cases, the operator is left to examine each individual mass spectrum manually, select the mass-to-charge ratios corresponding to known or likely mixture components, and then assemble a reduced total ion current chromatogram from the selected masses only. Such a procedure is clearly very time consuming. Furthermore, when the mixture contains unknown analytes, the operator cannot confidently determine which mass spectral peaks are noise and which are actual peaks. Thus the only recourse the operator has is to adjust various instrument parameters and repeat the experiment with a different sample, hoping for less noise in the resulting chromatogram.
Because it enables the identification and quantification of hundreds to thousands of analytes in a single injection, LC-MS is currently being used to analyze complex biological mixtures (see, e.g., D. H. Chace et al., “Mass Spectrometry in the Clinical Laboratory,” Chem. Rev. 101 (2001): 445-477). Proteomics is a relatively new field that aims to detect, identify, and quantify proteins to obtain biologically relevant information. Both proteomics and metabolomics (the detection, identification, and quantification of metabolites and other small molecules such as lipids and carbohydrates) may facilitate disease mechanism elucidation, early detection of disease, and evaluation of treatment. Recent advances in mass spectrometry have made it an excellent tool for structural determination of proteins, peptides, and other biological molecules. However, proteomics and small molecule studies typically have a set of requirements that cannot be met by manual interpretation of the LC-MS data.
First, these studies require high-throughput analysis of small volumes of biological fluid. Manual data interpretation creates a bottleneck in sample processing that severely limits the number of samples that can be analyzed in a given time period. Furthermore, while large available sample volumes allow an operator to adjust parameters by trial and error to obtain adequate chromatograms and spectra, biological samples are available in such small volumes that it is imperative to extract useful information from all of the available sample. Second, unlike traditional research applications, in which a relatively small amount of data is required, the paradigm of these studies is to acquire enormous amounts of data and then mine the data for new correlations and patterns. Manual data analysis is therefore unfeasible. In addition, biological samples are generally complex mixtures of unknown compounds, and so it is not desirable to extract only known spectra and discard the remaining data, an approach that has been used for studies involving quantification of known compounds in a mixture. Finally, LC-MS instruments produce an enormous amount of data: a single one-hour chromatographic run can produce up to 80 MB of binary data. For storage and subsequent data mining purposes, it is highly desirable to reduce the amount of data to retain information while discarding noise. To satisfy these requirements, a data analysis method is needed that can acquire a large amount of data from low-volume biological mixtures, extract useful information from the resulting noisy data set, and identify unknown compounds from the extracted information. An essential component of such a method is the ability to filter noise accurately so that peaks can be distinguished automatically.
The problem of filtering chromatographic noise has been addressed to various degrees in the prior art. The component detection algorithm (CODA) is an automated method for selecting mass chromatograms with low noise and low background. CODA is described in W. Windig et al., “A Noise and Background Reduction Method for Component Detection in Liquid Chromatography/Mass Spectrometry,” Anal. Chem., 68 (1996): 3602-3606. The method computes a smoothed and mean-subtracted version of each mass chromatogram, compares it with the original chromatogram, and calculates a similarity index between the two. Chromatograms whose similarity index exceeds a threshold value are retained and combined to form a reduced total ion chromatogram, while other chromatograms are rejected. CODA has proven very effective at selecting high-quality mass chromatograms. However, it can only accept or reject entire chromatograms based on their noise level, but cannot filter noise from an individual chromatogram. As a result, noisy chromatograms that contain useful information are eliminated, and important peaks may not be detected.
Techniques exist for filtering noise and background from spectrometric data. For example, U.S. Pat. No. 5,995,989, issued to Gedcke et al., describes a filtering method in which an average background level and an average deviation from the background are computed and used to define a local threshold for each data point. Points exceeding the threshold are retained, while points below the threshold are considered to be noise and discarded. The technique described in Gedcke et al. is only effective for noise levels that are substantially below the level of the peaks. For data such as that illustrated in FIG. 1B, high-intensity noise spikes cannot be removed using the disclosed method.
In U.S. Pat. No. 6,112,161, issued to Dryden et al., a method for enhanced integration of chromatography or spectrometry signals is described. A baseline signal is computed from a moving average of the actual signal. The difference between the baseline and actual signal is a baseline-adjusted signal containing peaks and high-frequency noise. An intensity range of the noise is determined, and all signal outside of this range is considered to be peaks, while signal inside this range is considered to be noise. As with the method of Gedcke et al., the method of Dryden et al. can only be used when the noise intensity is substantially lower than the signal intensity. Because LC-MS data often has noise values exceeding the signal values, the method of Dryden et al. is not effective at removing noise from LC-MS data.
A moving median digital filter has been used to remove noise from mass spectrometry and potentiometric titration data, as described in C. L. do Lago et al., “Applying moving median digital filter to mass spectrometry and potentiometric titration,” Anal. Chim. Acta, 310 (1995): 281-288. Each data point is replaced by the median of the values in a window surrounding the point. With respect to the mass spectrometry data, the filter is applied both to the electron multiplier output, i.e., the ion abundance values, and to the magnetic field sensor, i.e., the mass-to-charge ratio. The method is not, however, applied to two-dimensional data such as LC-MS data. In most cases, state-of-the-art LC-MS instruments do not report the mass spectra as continuous smooth peaks, but rather as centroided data, i.e., single-mass peaks at the average mass value of the true peak. Without centroiding, an unmanageable amount of data would be generated for each spectrum. A moving median filter applied to centroided mass spectral data would remove peaks and noise equally. Because the peak shape is removed in the reported data, filtering or analytical methods cannot be applied to the mass spectra. Moreover, in some cases, one major source of noise, detector noise, can corrupt an entire mass spectrum. If a high fraction of the points in the filter window are corrupted, then a median filter applied to the spectrum cannot remove this noise.
There is still a need, therefore, for a method for removing noise, particularly high-intensity spikes, from chromatographic and spectrometric data such as LC-MS data.