This invention relates, in its broadest aspect, to correcting measured spectral data of a number of samples for the effects of data arising from the measurement process itself (rather than from the sample components). However, it finds particular application to a method of estimating unknown property and/or composition data of a sample, incorporating steps to provide correction for such measurement process spectral data. Examples of property and composition data are chemical composition measurements (such as the concentration of individual chemical components as, for example, benzene, toluene, xylene, or the concentrations of a class of compounds as, for example, paraffin), physical property measurements (such as density, index of refraction, hardness, viscosity, flash point, pour point, vapor pressure), performance property measurement (such as octane number, cetane number, combustibility), and perception (smell/odor, color).
The infrared (12500-400 cm.sup.-1) spectrum of a substance contains absorption features due to the molecular vibrations of the constituent molecules. The absorptions arise from both fundamentals (single quantum transitions occurring in the mid-infrared region from 4000-400 cm.sup.-1) and combination bands and overtones (multiple quanta transitions occurring in the mid- and the near-infrared region from 12500-4000 cm.sup.-1). The position (frequency or wavelength) of these absorptions contain information as to the types of molecular structures that are present in the material, and the intensity of the absorptions contains information about the amounts of the molecular types that are present. To use the information in the spectra for the purpose of identifying and quantifying either components or properties requires that a calibration be performed to establish the relationship between the absorbances and the component or property that is to be estimated. For complex mixtures, where considerable overlap between the absorptions of individual constituents occurs, such calibrations must be accomplished using multivariate data analysis methods.
In complex mixtures, each constituent generally gives rise to multiple absorption features corresponding to different vibrational motions. The intensities of these absorptions will all vary together in a linear fashion as the concentration of the constituent varies. Such features are said to have intensities which are correlated in the frequency (or wavelength) domain. This correlation allows these absorptions to be mathematically distinguished from random spectral measurement noise which shows no such correlation. The linear algebra computations which separate the correlated absorbance signals from the spectral noise form the basis for techniques such as Principal Components Regression (PCR) and Partial Least Squares (PLS). As is well known, PCR is essentially the analytical mathematical procedure of Principal Components Analysis (PCA), followed by regression analysis. Reference is directed to "An Introduction to Multivariate Calibration and Analysis", Analytical Chemistry Vol. 59, No. 17, September, 1987, pages 1007 to 1017, for an introduction to Multiple Linear Regression (MLR), PCR, and PLS
PCR and PLS have been used to estimate elemental and chemical compositions and to a lesser extent physical or thermodynamic properties of solids and liquids based on their mid- or near-infrared spectra. These methods involve: [1] the collection of mid- or near-infrared spectra of a set of representative samples; [2] mathematical treatment of the spectral data to extract the Principal Components or latent variables (e.g. the correlated absorbance signals described above); and [3] regression of these spectral variables against composition and/or property data to build a multivariate model. The analysis of new samples then involves the collection of their spectra, the decomposition of the spectra in terms of the spectral variables, and the application of the regression equation to calculate the composition/properties.
The mathematical/statistical treatment of spectral data using PCR or PLS does not differentiate among possible sources of signals which are correlated in the frequency domain. In particular, PCR and PLS do not differentiate between signals arising from variations in sample components and signals arising from variations in the spectral measurement process. For mid- and near-infrared spectra, common measurement process signals include, but are not limited to, variations in the spectral baseline due to changes in instrument performance or changes in cell window transmittance, and signals due to water vapor and/or carbon dioxide in the spectrometer light path. These measurement process signals can contribute to the Principal Components or latent variables obtained by PCR or PLS, and may be correlated to the composition/property data during the regression. The resultant regression model will then be sensitive to variations in these measurement process variables, and measured compositions or properties can be in error.
In addition to sensitivity to measurement process signals, methods based on PCR or PLS do not correct for variations in the overall scaling of the spectral data. Such scaling variations can result from a variety of factors including variations in cell pathlength due to positioning of the cell in the spectrometer, and expansion or contraction of the cell during use. For situations where the sample flows through the cell during the measurement, variations in flow can also cause variations in the scaling of the spectral data which are equivalent in effect to variations in pathlength. PCR and PLS models require that spectral data be scaled to a specified pathlength prior to analysis, thus requiring that the pathlength be separately measured. The separate measurement of the cell pathlength prior to the use of the cell in collection of the sample spectrum is not convenient or in some cases (e.g. for an on-line flow cell) not possible, nor does such separate measurement necessarily account for the sources of variation mentioned above. Errors in the measured pathlength produce proportional errors in the composition/property data estimated by PCR and PLS models.