1. Field of Invention
This invention relates generally to a method for estimation of properties of materials by near-infrared analysis through improved spectral-model stability and accuracy and through improved ability to ascertain the applicability of a spectral model to an unknown. A particular example of this method is improved estimation of the octane numbers of gasolines by near-infrared (NIR) analysis.
2. Description of the Related Art
Because materials of different chemical compositions exhibit measurable differences in their absorption of infrared radiation, near-infrared (NIR) analysis or mid-infrared analysis can be used to estimate both the chemical composition of material and the physical properties of materials which result from their chemical composition.
NIR analysis is currently a common method for analyzing agricultural products. For example, NIR is used to analyze the protein content of wheat and other grains. In recent years, NIR also has been applied in the petrochemical industry for analysis of both chemical composition (e.g. aromatic and saturates content) and physical properties (e.g. octane number, density, vapor pressure) of hydrocarbons including gasoline.
NIR analysis has been of particular interest in the field of octane measurement for gasoline. Octane number ratings are a measure of the resistance of a gasoline to engine knock. There are two basis types of octane number ratings corresponding to two different sets of conditions under which the engine test is performed. For the same gasoline, the less severe test (Research Octane Number or RON) results in a higher octane number rating than does the more severe test (Motor Octane Number or MON). The average of RON and MON is often called Pump Octane Number (PON) because it is the number which is posted on gasoline station pumps. Pump Octane Number is alternatively called Road Octane Number because it represents the average performance of a gasoline under conditions of varying severity as one would likely encounter when actually driving down a road.
Historically, octane has been directly measured in test engines, where one listens for a knock in the engine, where the knock is the uncontrolled explosion of the last portion of the fuel-air mixture in the cylinder. An empirical scale was determined in the 1930's for octane wherein pure iso-octane was defined as 100, normal heptane as 0 and mixtures of the two were used to define intermediate octane numbers. Large one-cylinder engines are used for the direct measurement of the octane rating of a gasoline by comparing the intensity of the knock of the gasoline to that of a standard mixture of iso-octane and normal heptane and adjusting the compression ratio until the knock intensity of the gasoline is the same as for the standard prior to the adjustment.
Octane measurement using a knock engine is considered the primary analytical technique for determining the octane rating of a gasoline. NIR analysis on the other hand is a secondary analytical technique that is calibrated against the primary analytical technique. Typically, the primary analytical technique for any type of measurement is somewhat slow and cumbersome. As a secondary technique, NIR analysis is a fast and convenient method of inferring the chemical and physical properties of a substance. Regardless of the material to be analyzed, NIR analysis requires a training set of samples of the material for which one obtains both the near-infrared spectra and the primary lab measurements of the properties of interest.
Using regression mathematics, one then correlates the NIR spectra of the calibration set to the primary reference method measurements of the properties of these samples. The resulting regression equations allow one to estimate the properties of unknown samples of material (ones for which primary lab measurements have not been made) directly from their NIR spectra.
Chemometric techniques are often used to relate one set of properties of a sample (which is difficult or time-consuming to measure directly) to another set of properties (which is easier and faster to measure directly). For example, within seconds, one can estimate the protein content of wheat from the near-infrared reflection spectra of the flour. Similarly, a complicated physical property such a octane number can be related to the near-infrared transmission spectra of the gasolines.
Infrared spectra contain information about the number and type of functional groups that are associated with a sample's chemical composition. Infrared spectra are simply measurements of absorbance (sample darkness) versus wavelength or wavenumber (infrared "color"). Absorbance (rather than percent transmittance) is used for quantitative spectroscopy because, according to Beer's law, it is the absorbance which is proportional to the concentration of the absorbing species.
For transmission spectroscopy (light going completely through the sample), the absorbance, A, at a wavelength, .lambda., is defined as the base-ten logarithm of the ratio of intensity of light, I.sub..lambda.o, at wavelength .lambda., which enters the sample to the intensity of light I.sub..lambda., at wavelength .lambda., which exits the sample as shown in Equation 1: EQU A.sub..lambda. =log.sub.10 (I.sub..lambda.o /I.sub..lambda.) (1)
Therefore, 100% transmittance corresponds to zero absorbance, 10% transmittance corresponds to one unit of absorbance and 1% transmittance corresponds to two absorbance units. Notice that there is a nonlinear logarithmic relationship between transmittance, T.sub..lambda. =I.sub..lambda. /I.sub..lambda.o and the absorbance, EQU A.sub..lambda. =log.sub.10 (1/T.sub..lambda.).
To perform near-infrared calibration, one first obtains the near-infrared spectra of a training set of samples for which the properties of interest have been measured by traditional analytical techniques. Next, one tries to develop a regression equation (model) that relates the training set spectra to the properties. It is possible to correlate NIR spectra to physical properties such as octane number because these physical properties ultimately depend on chemical composition--albeit in complex, and often unknown, ways.
Equation 2 shows the final form of linear models regardless of whether they were obtained by multiple linear regression, principal components regression, or partial least squares regression. The model consists of those constants C.sub.0, C.sub.1, C.sub.2 . . . C.sub.N that provide the best fit between the property of interest and the training set spectra. Here, A.sub.1, A.sub.2, . . . A.sub.N are the absorbances at wavelengths .lambda..sub.1, .lambda..sub.2, . . . .lambda..sub.N which, taken collectively, constitute the spectrum. EQU Property=C.sub.0 +C.sub.1 A.sub.1 +C.sub.2 A.sub.2 + . . . +C.sub.N A.sub.N (2)
Any model which is designed to fit a particular set of points is at risk for interpolation (inlier) errors and/or extrapolation (outlier) errors in those regions where there are no data points. This can easily be illustrated with a simple one-dimensional example.
Consider the function F(x) given in Equation 3. It consists of a linear function of x, plus a quadratic function of x (that rapidly fades away when x&lt;2), plus a cubic function of x (that rapidly fades away with distance from x=1). EQU F(x)=x+(x-2).sup.2 [tanh(5(x-2))]+(x-1).sup.3 [300e.sup.-20(x-1)(x-1) ](3)
The solid curve in FIG. 1 is a plot of F(x) from x=0 to x=4. Table 1 lists 25 values of F(x) that were calculated from Equation 1 for selected values of x. These (x,y) data points are displayed as X's in FIG. 1. Notice that the values of x were carefully chosen so as to skip the valley and hill around x=1 and to stop before x=2 where the quadratic component of F(x) begins to dominate over the linear component.
For the data points listed in Table 1, one can build a simple linear model, Y.apprxeq.X, as shown by the dashed line in FIG. 1. The best linear fit to these data is the line Y=1.004X-0.002 which has an excellent correlation coefficient (R=0.999) and a superbly low standard error of calibration (0.027). Despite these seemingly good statistics, the model fails conspicuously around x=0.7, around x=1.3, and beyond x=2.
The point of this example is that no matter how good the correlation coefficient nor how low the standard error of calibration (SEC), a model is at risk for significant error in any region where there are no data points. Those regions that lie beyond the bounds of the data used are referred to as outlier risk zones. They are the places where extrapolation errors may occur. Inlier risk zones, on the other hand, are the gaps within the bounds of the data used in the model. They are places where model interpolation errors may occur. A good test exists for outliers (the Mahalanobis distance from the centroid of the data), but not for inliers.
To extend the discussion of inliers and outliers to functions of more than one variable, it is useful to define the Mahalanobis distance. The Mahalanobis distance, D, is a generalization of the familiar Euclidean distance, d, in which the measurement of distance is distorted along the principal directions of variation of the training set spectra. EQU Euclidean distance: d.sup.2 =x.sup.t Ix=.SIGMA.x.sub.i x.sub.i =.SIGMA.x.sub.i.sup.2 (4) EQU Mahalanobis distance: D.sup.2 =x.sup.t (XX.sup.t).sup.-1 x=.SIGMA.x.sub.i M.sub.ij x.sub.j (5)
where X=matrix of mean-centered calibration spectra
(or the scores of the principal components or the scores of the latent variables of these spectra)
XX.sup.t =N(XX.sup.t /N)=number of samples times covariance matrix
(XX.sup.t).sup.-1 =inverse of number of samples times covariance matrix
x=a spectrum whose distance from the center is sought or the difference between two spectra whose relative distance is sought.
One theorem associated with the Mahalanobis distance is that the root-mean-square (RMS) Mahalanobis distance is equal to the square root of the number of fitting parameters, k, used by the model divided by the number, N, of calibration samples that the model is fitting with these parameters. That is, EQU D.sub.RMS =(k/N).sup.1/2 (6)
As shown in FIG. 2, one can think of the Mahalanobis distance (M-distance) as something which is measured by a special ruler that stretches and compresses with orientation. Often, the M-distance is expressed in units of root-mean-square M-distances. This "standardized" M-distance can be thought of as a generalization of the concept of standard deviation to multiple (and possibly correlated) dimensions. Samples are described as being 1, 2, or 3 M-distances away from the center of the cloud of spectral data points in much the same way as measurements might be described as being 1, 2, or 3 standard deviations away from the average of the measurements.
As shown in FIG. 3, an outlier is a sample that lies outside the cloud of calibration set points. It is at risk for NIR model extrapolation error. The M-distance can be used to test for outliers. Typically, the outlier cutoff is taken to be somewhere between the square-root-of-2 and 2 times the root-mean-square Mahalanobis distance from the center of the cloud of data points.
As shown in FIG. 4, an inlier is a sample that resides in a gap in the cloud of calibration set samples. An inlier is at risk for NIR model interpolation error. As described in their article entitled "Population Definition, Sample Selection, and Calibration Procedures for Near-Infrared Reflectance Spectroscopy," Crop Science, 31, 469 (1991), J. S. Shenk and M. O. Westerhaus (Shenk and Westerhaus) used a "no-nearest-neighbor" cutoff of 0.6 Mahalanobis distance to define an inlier. One problem with the Shenk and Westerhaus definition is that it is too strict in regions where the response surface of the property of interest is flat and too lenient in regions where the response surface of the property of interest is "hilly."
It is analogous to a surveyor saying that he can generate accurate elevation maps if he has an elevation point every 10 miles. In flat deserts, 10 miles would be an unnecessarily frequent sampling rate, while in rugged, mountainous regions, 10 miles would be an inadequately sparse sampling rate because an entire hill or valley could be missed.
This issue is of considerable practical importance, especially when using NIR to estimate physical properties such as octane number that have very complicated response surfaces. FIG. 5 shows that the pump octane number response surface is very "hilly." Here it is seen plotted against those two principal component scores that, together, account for 81% of the variance of the pump octane number.
Thus, the Shenk and Westerhaus method for determining inliers in a set of data has several deficiencies. It is unable to automatically take into account how rapidly the local response surface is changing relative to the local spectral spacing of the calibration samples. It would be desirable to automatically "adjust" to the ruggedness of the response surface terrain. Other practical problems with using the Shenk and Westerhaus definition of an inlier are that it requires one to keep the calibration set handy at the application location and that it takes a very long time (despite using a computer) because one must compute the M-distance from an unknown to each and every calibration set sample. On information and belief, the Shenk and Westerhaus method for determining inliers is the only prior art test for inliers, and it is quite slow and, for many response surfaces, unreliable. Thus, it is desirable to have a fast and reliable test for inliers and a method to estimate the degree of risk for significant model interpolation error (the inlier probability) at any point in the inlier risk zones.