There are many well-established chemometric techniques used to facilitate the handling of chemical data: techniques such as principal components analysis (PCA), parallel factor analysis (PARAFAC), and partial least-squares (PLS) being among the most common. This proliferation of chemometric techniques can be attributed to several factors, including improvements in computing technology and more user-friendly software coupled with advancements in analytical instrumentation. Modern instrumental techniques can provide an abundance of detailed data pertaining to the nature of a complex sample in a relatively short period of time. This in turn has permitted the analyst to probe increasingly complex samples, and pose increasingly challenging questions, the answers of which can only be revealed though the use of chemometric tools.
Some of the most impressively data-rich analytical techniques to which chemometric techniques have been applied include the family of hyphenated separations techniques such as GC-MS and LC-MS as well as comprehensive multidimensional separation techniques such as GC×GC. For example, grades of gasoline were classified based on their GC-MS profiles by Doble et al using PCA. Sandercock and Du Pasquer have used GC-MS coupled with PCA to fingerprint a series of gasoline samples and identify the origin of the samples. Another field where chemometric techniques are widely applied is in metabolomics (as well as general metabolite profiling and metabonomics). Wilson et al. have recently reviewed the application of LC-MS to this field, highlighting some uses of chemometrics. Other examples of the use of chemometrics in this area include Bruce and co-workers who recently evaluated metabolite profiling techniques and used PCA and orthogonal projections to latent structures discriminant analysis (OPLS-DA) on HPLC-MS data, and the work of Lutz et al. who used partial least-squares discriminant analysis (PLS-DA) and PCA in LC-MS/MS-based metabolic profiling to predict gender. Kind et al. applied PLS, PCA and ANOVA to data from a suite of chromatographic-mass spectrometric analyses of urinary metabolites for the early detection of kidney cancer. Now that comprehensive multidimensional gas chromatographic (GC×GC) instrumentation has become commercially available, it (in conjunction with chemometric techniques) is also gaining much interest in this area. For example, Vial et al. used PCA to classify tobacco extracts based on their GC×GC profiles, and Mohler et al. have used GC×GC-TOFMS data and chemometric techniques to analyze yeast metabolites.
When applying chemometric techniques to chromatographic or chromatographic-mass spectrometric data, there are several possible approaches to preparing the data for analysis. Many users employ integrated peak tables of data as this provides a matrix that is relatively small and straightforward: analyte abundances vs. sample numbers. Other users choose to use a non-integrated chromatographic signal for the construction of a chemometric model. With this approach, each variable in the data matrix is the signal intensity at a given time. This route has its own challenges, including increased data size and data alignment; however, these can be overcome relatively easily and this approach is in many cases superior to the use of integrated peak tables. The advantage of using the entire raw data set is more evident when one utilizes the entire GC-MS chromatogram, either as a three-way array (scan number×m/z ratio×sample number) or as a two-dimensional array of samples vs. GC-MS chromatograms unfolded along their time axis. Synovec et al. demonstrated the significant advantages can be achieved by using the entire GC-MS chromatogram rather than extracted ion chromatograms or other univariate signals. The reason for this being that the chemometric model can extract underlying patterns in the data that are not evident in univariate signals.
One challenge that remains for all types of chemometric analyses is that of feature selection: choosing which of the variables that have been collected will be included in the chemometric model. In cases where one is utilizing raw chromatographic data or chromatographic-mass spectrometric data, feature selection becomes at the same time more challenging and more important as millions of variables can be easily collected for each sample. A dataset comprising even a relatively small number of samples such as these will put inordinate demands on a computer system. Apart from the technological challenge, the most important reason for careful variable selection is that not all variables will be relevant. This is especially true when the entire chromatogram is considered: only a small portion of the chromatographic space actually contains relevant signal intensities. If irrelevant variables are included, the model must account for irrelevant variations and this will degrade its overall performance. Consequently, careful variable selection is necessary, especially if raw chromatographic signals are being used in the construction of the model.
There are multiple variable selection techniques that are available, all with the goal of simplifying data sets and removing extraneous variables. Selection techniques such as using integrated peak tables, extracted ion chromatograms (EIC), or single ion monitoring (SIM) rely very highly on a priori knowledge and select variables by only permitting a small, user-selected portion of the data to be used. These are not inappropriate approaches, but they are potentially dangerous, especially if the system is not well understood. The reason for this is that, in these cases, there are numerous opportunities for either the inclusion of significant quantities of irrelevant data or the inadvertent exclusion of relevant portions of the data. Within the scope of chromatography-MS data, total ion chromatograms (TICs) may also be used for modeling, but this sacrifices essentially all of the additional mass spectral information and potentially useful variables in the process.
Objective variable ranking techniques are another option for guiding feature or variable selection. These methods use a calculated metric to evaluate the potential value of each variable. When constructing a model, only those variables with scores above a certain threshold will be used. Examples of the use of objective variable ranking applied to chromatographic data include the work of Rajalahti et al where the discriminating variable (DIVA) test was used to rank variables for both PCA and PLS-DA of chromatographic profiles. Another popular metric for variable ranking is Analysis of Variance (ANOVA) which has been used to guide feature selection for PCA of GC-MS and GC×GC chromatograms. Teófilo et al. have also used informative vectors as the ranking metric prior to PLS analysis of spectroscopic data. Apart from the inherent advantage of objectivity, objective ranking strategies allow the user to consider considerably more candidate variables with no a priori information and can be readily incorporated into automated routines.
A final approach to variable selection is the application of a genetic algorithm (GA). GAs strive to find a suitable set of variables by randomly choosing multiple subsets of variables, evaluating each subset, and then “breeding” a new generation of models by randomly mixing the variables that were included in more successful models. The process then repeats through many generations and a suitable subset of variables is allowed to evolve by chance. The main advantage of GAs is that they can proceed without much user intervention. However, much computational time is required and, they typically exhibit severe overfitting of the data and/or converge to non-optimal solutions, especially with data sets that comprise a large number of variables. Strategies to overcome these limitations have recently been presented by Ballabo et al. However, GAs remain comparatively computationally inefficient.
Regardless of the variable selection technique that is applied, the goals are to remove noise and irrelevant variables while preserving variables that are of value. For example, when techniques such as ANOVA and DIVA are used to select variables to be included in a PCA model, variables are ranked based on their relative ability to discriminate between the classes of samples being considered. Variables with a high ranking are likely to improve class separation, and those with a low ranking are deemed to be irrelevant. As more variables are included, it is more likely that information useful for class discrimination will be included in the model, though each additional variable is likely to be less useful than the previous ones. However, with each new variable more noise is added to the model, possibly reducing the model's ability to discriminate between classes. At some point, the addition of new variables will result in an overall loss of model quality.
This highlights the central problem that must be addressed. In cases where one is attempting to construct a chemometric model of a large data set, how does one objectively choose the optimal combination of features to model the data? Further, how can one quantify and thereby objectively compare the separation and clustering of data points belonging to multiple classes in, for example, a PCA model? This can be judged through visual inspection of various diagnostic plots of the model. However, in order to achieve a fully automated and objective process for feature selection, an objective metric is required.
While metrics for the degree of class separation have been used previously, prior metrics do not account simultaneously for the shapes, sizes and relative orientations of clusters of points on, for example, a PCA scores plot. Such an objective metric should consider these three parameters.