The invention relates to the field of data processing and evaluation. In particular, the invention relates to the processing and evaluation of mass chromatographic and mass spectrometric data.
Developments both in mass spectrometric technology and in the combination of mass spectrometers (xe2x80x9cMSxe2x80x9d) with a broad variety of separation and micro-scale separation techniques, are quickly increasing the capacity of MS in terms of data production. Using modern instrumentation, the time required to obtain the above-mentioned data, such as chromatograms and mass spectra, is no longer the critical factor; rather, it is the time necessary for analyzing the data. In particular, a data set often comprises thousands of mass spectra measured over a mass-to-charge (xe2x80x9cm/zxe2x80x9d) range of two to three orders of magnitude. An extended study using such a data set can occupy days if a complete analysis is required. In a research environment in particular, this analysis typically must be carried out by highly qualified, and consequently expensive, personnel.
In this context, the use of efficient data processing and evaluation to improve speed in data handling is highly desirable. Depending on the application, information extraction can be approached from different points of view. In impurity studies by capillary electrophoresis/mass spectrometry (CE/MS) or liquid chromatography/mass spectrometry (LC/MS), for example, data processing and evaluation tools must be able to perform efficient peak detection of compounds present at very low levels. On the other hand, if screening and comparison of very similar complex mixtures is to be performed, such as in the rapidly expanding field of proteomics, data processing and evaluation tools must be able to correlate data on multiple complex mixtures.
One prior approach to processing data produced by a combination of mass spectrometry and chromatography is U.S. Pat. No. 5,672,869 to Windig et al. (xe2x80x9cthe ""869 patentxe2x80x9d). The ""869 patent describes a data processing approach which separates spurious peaks and noise by smoothing the raw data. This approach then compares processed and raw data. If a mass trace contains only background noise, the difference between raw and processed data is emphasized, and the algorithm assigns a low mass chromatographic quality (xe2x80x9cMCQxe2x80x9d) value to that particular mass trace. On the other hand, mass traces containing a peak are assigned high MCQ values. The ""869 patent then teaches selecting only mass traces that possess a MCQ above an appropriate threshold value.
However, it is not necessarily clear what is an appropriate threshold value, especially for complex and/or noisy data. For example, by selecting a threshold which is too high, some relevant information on low intensity signals may be lost, while setting too low a threshold may select many xe2x80x9csignalsxe2x80x9d that are actually just background noise. As a result, extensive visual examination of raw and processed data by trained personnel may be required to address this problem, and thereby lower data processing efficiency and speed.
A need therefore exists for a data processing technique that provides more efficient and clear data processing.
The present invention adapts an information content theory and combines it with data smoothing to provide a measure of data quality that better facilitates the efficient and clear evaluation of data. In particular, the present invention provides a measure of data quality based on what is referred to herein as an entropy value. The entropy value approach of the present invention improves data processing by providing less ambiguous thresholds for data selection. As a result, for example, the present invention speeds data processing by decreasing the amount of time trained personnel may be required to personally inspect and select data.
The present invention provides a method of data processing in which the separation between spurious peaks and noise on the one hand, and relevant data on the other hand, takes place more accurately and clearly, thereby shortening the data analysis time. Consequently, trained personnel can use their time interpreting the data. At the same time, the present invention provides the option of generating fingerprints of complex mixtures, which are increasingly being used in various fields (chemistry, pharmacy, medicine, biology, biotechnology, and the like), but particularly in the life sciences, for example, from the analysis of biological materials oriented towards DNA fragments, proteins and metabolic components.
In one aspect, the present invention provides a method of data processing and evaluation comprising the steps of smoothing the data points of a chromatogram and determining an entropy value for the smoothed-chromatogram. In one embodiment, the method also comprises the step of correcting the data points of a chromatogram for baseline prior to determination of an entropy value for the smoothed-corrected chromatogram. The chromatogram may be either a mass chromatogram or a total ion current (xe2x80x9cTICxe2x80x9d) chromatogram. It should be realized that the order of the smoothing and baseline correcting steps is unimportant to the present invention. That is, a chromatogram may be smoothed then baseline corrected, or baseline corrected and then smoothed. Accordingly, it is to be understood that the term xe2x80x9csmoothed-corrected chromatogramxe2x80x9d does not imply a specific order of practice.
In another embodiment, the method of the invention further determines a quality factor (i.e., an xe2x80x9cIQ valuexe2x80x9d) for a chromatogram based on the evaluation of entropy values for a plurality of chromatograms of a data set. In a preferred embodiment, the method selects individual chromatograms (of either corrected-chromatograms and/or smooth-corrected chromatograms) based on their IQ values. The method then uses these selected chromatograms to generate a reconstructed total ion current (xe2x80x9cRICxe2x80x9d) chromatogram. The method may further exclude from the RIC chromatogram one or more mass signals. In one embodiment, the one or more mass signals are selected for exclusion based on a mass signal quality value for the individual mass signals. In another embodiment, the method uses these selected chromatograms to generate a reconstructed mass chromatogram for one or more mass values. Further, in various embodiments, the RIC chromatograms are used as a fingerprint for comparison to other chromatograms of the same or other data sets.
In another aspect, the present invention provides a method of data processing and evaluation that correlates either a smoothed-chromatogram or a smoothed-corrected chromatogram with a plurality of chromatograms of a data set. The chromatogram of the smoothed-corrected-chromatogram (or smoothed-chromatogram) and the data set chromatograms may be, for example, a mass chromatogram, total ion current chromatogram, or a RIC chromatogram. In a preferred embodiment, the step of determining a correlation comprises using a multivariate analysis. Suitable forms of multivariate analysis include, but are not limited to, principal component analysis (xe2x80x9cPCAxe2x80x9d), discriminant analysis (xe2x80x9cDAxe2x80x9d), partial least squares (xe2x80x9cPLSxe2x80x9d), predictive linear discriminant analysis (xe2x80x9cPLDAxe2x80x9d), neural networks, and pattern recognition techniques.
In another embodiment of the present invention, the entropy values of a plurality of smoothed mass chromatograms are each calculated and stored, followed, if desired, by processing of these entropy values or, as the case may be, components selected according to these entropy values by means of chemometric and biometric methods. Preferred forms of component selection include multivariate analysis techniques (PCA, DA, PLS, PLDA, neural networks), pattern recognition techniques and Fourier transform techniques. In another embodiment, the selected components are further used to generate a fingerprint and that is used, in conjunction with chemometric and biometric techniques, as a characterization method for complex mixtures of various origins.
In another aspect, the present invention provides a system for data processing and evaluation. The system is characterized in that it comprises a smoothing device for smoothing the data points of a mass chromatogram and an entropy calculation device for determining the entropy value of a mass chromatogram. In one embodiment, the system further comprises a baseline correction device for correcting the baseline of a chromatogram. Preferably, the system comprises a chromatograph for separating the components of the mixture and a spectrometer to which the separated components are delivered. In one embodiment, the system further comprises a storage device for storing the entropy values.
In another aspect, the method and system of the present invention relates to methods for identifying and quantifying chemical components of a mixture of materials the method generally comprises the steps of: (1) subjecting the mixture to a separation method to separate the components of the mixture into separate materials; (2) subjecting the separated materials to mass spectrometry to detect and to identify the components, and to obtain a total ion current (xe2x80x9cTICxe2x80x9d) chromatogram (or ion electropherogram) and mass spectra; (3) selecting masses from the mass spectra; and (4) obtaining mass chromatograms for each mass.
The foregoing and other features and advantages of the invention, as well as the invention itself, will be more fully understood from the description, drawings, and claims which follow.