Metabolic profiling or metabolomics is emerging as an important new methodology focused on the quantitative analysis of low molecular weight endogenous biochemicals in cells, biofluids, or tissues. Nicholson et al., 1 Nat. Rev. Drug Discov. 153-61 (2002); Watkins & German, 3 Curr. Opin. Biotechnol. 512-6 (2002). In contrast to traditional approaches where a particular compound, or group of compounds, is targeted in a sample, metabolic profiling (or biochemical profiling) is a comprehensive measurement of the biochemical makeup of a sample. Metabolic profiling is often referred to as a “global” or “non-targeted” approach. Biological samples are generally complex mixtures of unknown compounds and, therefore, metabolic profiling generally involves collection of a large amount of data and subsequent mining of the data for new correlations and patterns.
Metabolic profiling is generally performed using a chromatography step followed by a spectroscopic step, and use of such methodology results in the generation of complex chromatographic data sets. Using any one of gas or liquid chromatography-mass spectrometry (GC-MS; LC-MS), liquid chromatography-NMR (LC-NMR) and liquid chromatography-ultra violet spectroscopy (LC-UV) enable simultaneous detection and quantification of a broad range of biochemicals in biological samples. MS methods generally offer the greatest sensitivity and, thus, are generally best suited for metabolic profiling.
Due to the complexity of data collected for a broad range of compounds in a biological sample, extracting meaningful information is difficult, even with recent advances in instrument hardware and computer systems resulting in increased sensitivity and resolution. For example, high background and noise levels generally associated with electrospray LC-MS data make visual analysis difficult with respect to identification of the components present as often, few if any, distinct peaks are observable. Manual examination is frequently employed to extract a list of masses of components that appear to be “real,” a method that is not only time-consuming and tedious, but also one that may result in failure to identify highly overlapping and/or minor components. Similarly, use of available processing algorithms for non-targeted extraction of information from such data generally results in a loss of information and an introduction of error.
Algorithms for extracting information from chromatographic data include the Biller Biemann algorithm for resolution enhancement to separate overlapping peaks. Biller & Biemann, 7 Anal. Letters 515-28 (1974); Dromery et al., 48 Anal. Chem. 1368-75 (1976). Although the method works well for high quality data, i.e. where the peaks can clearly be discriminated from the background signal, the algorithm does not perform well for data having a high amount of noise, such as LC-MS data. Similarly, background subtraction can be performed as described by Goodley & Imitani, 25 Am. Lab 36B-36D (1993), but is of limited use for complex data in which the background is not constant over the duration of the chromatographic analysis.
The majority of recent methods for extraction of information from chromatographic data are in the field of curve resolution, such as that described by Hamilton & Gemperline, 4 Chemometrics 1-13 (1990). While curve resolution techniques are generally able to resolve overlapping peaks in chromatographic data with low background and noise levels, the techniques have limitations when applied to chromatographic data in which chromatograms of a single dimension (e.g., mass chromatograms) contain multiple peaks. Mass chromatograms with more than one peak are not uncommon in GC- and LC-MS, due to the presence of isomers and components with common fragments. As curve resolution techniques fail to resolve multiple peaks having a single chromatographic dimension, the techniques are generally of limited value for use in analyzing metabolic profiling data. Abbassi et al. have described an automated approach for the extraction of peaks from GC-MS data with high noise and high background. Abbassi et al., 141 Mass Spectrum. Ion Proc. 171-86 (1995). One disadvantage of the Abbassi et al. technique is that transformation of the original data is required in order to enhance the quality of the signal.
Other automated methods that are commonly used to distinguish noise and background contributions in complex chromatographic data, and that do not require transformation of the original data include COMPONENT DETECTION ALGORITHM (CODA), U.S. Pat. No. 5,672,869, (Advanced Chemistry Development, Inc., Toronto ON, Canada); TARGETDB (Thru-Put Systems, Inc., Orlando, Fla.); XCALIBUR (Thermo Electron Corporation, San Jose, Calif.); DATAEXPLORER (Applied Biosystems, Foster City, Calif.); and TURBOQUAN (Perkin Elmer Biosystems, Wellesley, Mass.). Each of the non-transforming methods is effective with a targeted approach, but each method breaks down when a non-targeted comprehensive list of compounds present in a chromatogram is desired. In a targeted approach, the methods function by identifying mass ions corresponding to a particular targeted compound as those masses occurring within a pre-determined mass cut-off window on either side of the expected mass of the compound. The size of the mass cut-off window is a function of the resolution of the MS instrument. In the absence of prior knowledge of expected masses (a non-targeted approach), the methods function by assigning mass ions to predetermined groups of an equal size resolution window at evenly spaced intervals. As a result, mass ions that correspond to one particular compound whose chromatographic resolution spans multiple time scans may be incorrectly identified as separate compounds or, vice versa, ions corresponding to separate compounds within a particular time region may be assigned to a single component.
The weaknesses of the non-transforming methods described above when used for non-targeted analysis are not limited to the introduction of errors as describe above, but also include substantial loss of information from data having a high degree of noise and/or a low signal to noise ratio (both of which are typical of metabolic profiling data). For example, the CODA method computes a smoothed and mean-subtracted version of each mass chromatogram, compares it with the original chromatogram, and calculates a similarity index between the two. Chromatograms having similarity indices exceeding a threshold value are retained and combined to form a reduced total ion chromatogram, while other chromatograms are rejected. CODA has proven very effective at selecting high-quality mass chromatograms. However, the algorithm is limited to accepting or rejecting entire chromatograms based on noise level, and cannot filter noise from an individual chromatogram. As a result, noisy chromatograms that contain useful information are eliminated, and important peaks may not be detected.
Accordingly, there is a need for improved methods of processing complex chromatographic data in a global or non-targeted manner. The present invention provides such improved methods, which are efficient and minimize loss of information and introduction of error.