The present invention relates to multivariate models. In particular, the present invention relates to calibrating multivariate models. More particularly, the present invention relates to optimizing the calibration of multivariate models.
Multivariate models are used to relate multivariate analytical measurements such as infrared spectra (independent X-block variables) to component concentrations and physical properties (dependent Y-Block variables). During the calibration of these models, data (spectra and concentrations/properties) are measured for a set of calibration samples, and a regression model is built to relate the dependent Y-Block variables to the independent X-Block variables. One means of performing such a calibration is through the use of Constrained Principal Spectra Analysis (J. M. Brown, U.S. Pat. No. 5,121,337, Jun. 9, 1992). Alternatively, Principal Component Regression (PCR), Partial Least Squares Regression (PLS), or Multilinear Regression (MLR) could also be used. PCR, PLS and MLR are described in ASTM Practice E1655. Once the multivariate model is calibrated, it may be applied to new sample X-Block data to estimate the corresponding concentration/property Y-Block data for the unknown.
Multivariate models are the basis by which on-line infrared analyzers are used to estimate component concentrations such as benzene content, saturates content, aromatics content and olefin content for motor gasolines, diesel fuels, jet fuels and process streams, and properties such as research and motor octane number of gasolines and cetane number for diesel fuels from infrared spectra. For example, Maggard describes the use of MLR and PLS models for measuring paraffin, isoparaffin, aromatics, naphthene and olefin (PIANO) contents of motor gasolines and gasoline components (U.S. Pat. No. 5,349,189, Sep. 20, 1994). Maggard also describes the use of MLR for measuring octane and cetane numbers (U.S. Pat. No. 4,963,745, Oct. 16, 1990 and U.S. Pat. No. 5,349,188, Sep. 20, 1994). Perry and Brown (U.S. Pat. No. 5,817,517, Oct. 6, 1998.) describe the use of FT-IR for determining the composition of feeds to hydrocarbon conversion, separation and blending processes.
The use of multivariate models is not limited to infrared analyzers. Jaffe describes the use of Gas Chromotography and MLR to estimate octane numbers for gasolines (U.S. Pat. No. 4,251,870, Feb. 17, 1981). Ashe, Roussis, Fedora, Felsky and Fitzgerald describe the use of Gas Chromotography/Mass Spectrometery (GC/MS) and PCR or PLS multivariate modeling for predicting chemical or physical properties of crude oils (U.S. Pat. No. 5,699,269, Dec. 16, 1997). Cooper, Bledsoe, Wise, Sumner and Welch describe the use of Raman spectroscopy and PLS multivariate modeling to estimate octane numbers and Reid vapor pressures of gasolines (U.S. Pat. No. 5,892,228, Apr. 06, 1999).
The accuracy of a multivariate model is highly dependent on the samples that are used in its calibration. If the samples do not span a sufficient range of the potential variation in the X-Block data, then many of the unknowns that are analyzed will be outliers relative to the model. Since analysis of outliers is via extrapolation of the model, the accuracy of the estimates may be diminished. In addition, if the calibration samples do not adequately represent the structure of the X- and Y-Blocks, the resultant models may produce biased estimates of the component concentration and property values. The present invention is aimed at minimizing this potential bias while simultaneously ensuring that the X-Block range is adequately spanned by the calibration set.
In developing applications that use multivariate models, it is typical to first conduct a feasibility study to demonstrate that the component concentrations and/or properties can be related to the multivariate analytical measurement in question (e.g. infrared spectrum). Since for such feasibility studies, only a limited amount of data is collected, initial models will typically be generated using all available data and using cross-validation as a means of estimating model performance. As additional materials are analyzed, they can be added to the model to improve the scope of the multivariate model. Gethner, Todd and Brown (U.S. Pat. No. 5,446,681, Aug. 29, 1995) describe how samples which extend the range of the calibration or fill voids in the calibration might be automatically identified and captured.
As more samples become available, it is typical to divide the available samples into a calibration set which is used to develop the multivariate model, and a validation set which is used to validate the performance of said model. ASTM Standard Practice E1655 describes the use of calibration and validation sets. If samples are taken from a process, it is typical that samples near the production average may become over-represented in the data set relative to samples representing more atypical production. If the division between calibration and validation sets is made randomly, extreme samples (outliers) may end up in the validation set where they are estimated via extrapolation of the resultant model, and the range of the model may be limited. In addition, the over-representation of the more average production may lead to biased estimates for samples away from this average.
Several methods have been proposed to make the subdivision of samples into calibration and validation set based on the independent variable X-Block data, which in the case of FT-IR analyzers are the infrared spectra.
Honigs, D. E., Hieftje, G. M. Mark, H. L. and Hirschfeld, T. B. (Analytical Chemistry, 1985, 57, 2299-2303) proposed a method for selecting calibration samples based on the use of spectral subtraction. The spectrum with the largest absorption is initially selected, and subtracted from all other spectra to cancel absorptions at the frequency of the largest absorption. The spectrum with the largest absolute value signal remaining is selected next, and again subtracted from all other spectra to cancel the signals at the frequency of the largest absolute value signal. The process is repeated until the remaining signal is judged to be at the spectral noise level. For each independent signal in the X-Block data, the selection of one calibration sample cancels the signal. Thus this selection process can only select a very limited number of samples before reaching the noise level. The resultant calibration set would contain too few samples relative to the rank of the data matrix to be useful for modeling. Further, since the selection process does not make use of the dependent (Y-Block) variable s, it may not produce unbiased models.
Kennard, R. W. and Stone, L. A. (Technometrics 1969, 11, 13 7-149) proposed a subset selection method which was applied to the problem of calibration set selection by Bouveresse, E., Harmn C., Massart, D. L., Last, I. R., and Prebble, K. A. (Analytical Chemistry 1996, 68, 982-990). Distances were calculated between spectra based on the raw spectral data. The two samples that were farthest apart were selected as calibration sample s. For each remaining sample, minimum distance to a calibration sample is calculated. The sample with the largest nearest neighbor distance is added to the calibration set, and the process is repeated until the desired number of calibration samples is obtained. Isaksson, Tomas & Naes, Tormod (Applied Spectroscopy 1990, 44, 1152-1158) used a similar sample selection procedure based on cluster analysis of sample spectra. A Principal Component Analysis of the sample spectra is conducted, and the furthest neighbors are calculated in the variable space defined by the scores for the Principal Components with the largest eigenvalues. Neither selection process makes use of the dependent (Y-Block) variables, and neither is guaranteed to produce unbiased models.
To include Y-Block information in the sample selection process, the following methodology has been used. A list of the samples is sorted based on one of the property/component concentrations to be modeled. Every m.sup.th sample in the list is marked as a calibration sample. The samples are resorted on successive property/component concentrations, and the marking procedure is repeated. The samples marked as designated as calibration samples. The value of m is chosen such that the desired number of calibration samples is selected. The procedure ensures that the samples span the range of the Y-Block. As an alternative, the scores from a Principal Components Analysis (or Constrained Principal Spectra Analysis) of the X-Block data can be included in this procedure to ensure that the calibration samples span both the X- and Y-Blocks. This methodology tends to minimize outliers in the validation set, but it has not been found to produce optimum, unbiased models.