Throughout this application, various patents and publications are referred to. Disclosure of these publications and patents in their entirety are hereby incorporated by reference into this application to more fully describe the state of the art to which this invention pertains.
The present invention relates to the field of qualitative and quantitative spectroscopic analysis.
Infrared spectroscopy is a technique which is based upon the vibrational changes of the atoms of a molecule. In accordance with infrared spectroscopy, an infrared spectrum is generated by transmitting infrared radiation through a sample of an organic compound and determining what portion of the incident radiation are absorbed by the sample. An infrared spectrum is a plot of absorbance (or transmittance) against wavenumber, wavelength, or frequency. Infrared radiation is radiation having a wavelength between about 750 nm and about 1000 xcexcm. Near-infrared radiation is radiation having a wavelength between about 750 nm and about 2500 nm.
In order to identify the presence and/or concentration of an analyte in a sample, the near-infrared reflectance or transmittance of a sample is measured at several discrete wavelengths, converted to absorbance or its equivalent reflectance term and then multiplied by a series of regression or weighting coefficients calculated through multiple-linear-regression mathematics.
In the past, analysis was done only via transmission measurements from clear solutions, using solvents having little or no absorbency at the wavelength of the analyte. The absorbance (A) of an analyte in a non-absorbing solution at a specified wavelength, is represented by the equation abc, wherein a is the absorptivity constant, b is the pathlength of light through the samples and c is the concentration of the analyte. In this prior art system, the calibration sample consisted of a predetermined set of standards (i.e. samples of a known composition) which were run under the same conditions as the unknown samples, thereby allowing for the determination of the concentration of the unknowns.
In early infrared radiation analysis, deviations from Beer""s law caused for example, by instrument noise or a nonlinear relationship between absorbency and concentration were common. Calibration curves, determined empirically, were required for quantitative work. The analytical errors associated with quantitative infrared analysis needed to be reduced to the level associated with ultraviolet and visible methods. Least-squares analysis allowed the chemist to determine a calibration equation. The spectroscopic data (Y) was the dependent variable and the standard concentrations were the independent variable (X).
Various methods have been developed to improve and expedite the interpretation of NIRA spectra. Examples of methods of processing NIRA spectral data to generate a comparison factor to be used in determining the similarity between the composition of a test sample and a standard material are found in U.S. Pat. No. 5,023,804 issued Aug. 23, 1988 to Hoult; U.S. Pat. No. 4,766,551 issued Aug. 23, 1988 to Begley; U.S. Pat. No. 5,900,634 issued May 4, 1999 to Soloman; U.S. Pat. No. 5,610,836 issued Mar. 11, 1997 to Alsmeyer et al.; U.S. Pat. No. 5,481,476 issued Jan. 2, 1996 to Windig; U.S. Pat. No. 5,822,219 issued Oct. 13, 1998 to Chen et al.
Instruments have improved enormously. The noise and drifts associated with earlier instruments have improved with the changeover of electronic circuitry from tubes to semiconductor circuits. Modem applications of spectroscopy, and particularly near-infrared spectroscopy, have gone away from the simple two-component mixtures to analysis of multi-component mixtures of an unknown nature (e.g., natural products). However, because of the of the large amount of sample variance in multi-component mixtures use of the standard set is no longer possible.
The net result of the evolution of NIRA is to interchange the roles of the spectroscopic and standard values for the calibration samples. Previously, the standard values (i.e., the composition of the known samples) were considered to be more accurate than the spectral data. However, now that the calibration samples are a multi-component mixtures of unknown nature, it is the spectroscopic values that are known with better precision and accuracy.
In accordance with the present invention, an automated method for modeling spectral data is provided. The samples are analyzed and spectral data is collected by the method of diffuse reflectance, clear transmission, or diffuse transmission. In addition, for each sample, one or more constituent values are measured. In this regard, a constituent value is a reference value for the target substance in the sample which is measured by a independent measurement technique. As an example, a constituent value used in conjunction with identifying a target substance in a pharmaceutical tablet sample might be the concentration of that substance in the tablet sample as measured by high pressure liquid chromatography (HPLC) analysis. In this manner, the spectral data for each sample has associated therewith at least one constituent value for that sample.
The set of spectral data (with its associated constituent values) is divided into a calibration sub-set and a validation sub-set. The calibration sub-set is selected to represent the variability likely to be encountered in the validation sub-set.
In accordance with a first embodiment of the present invention, a plurality of data transforms is then applied to the set of spectral data. Preferably, the transforms are applied singularly and two-at-a-time. The particular transforms used, and the particular combination pairs used, are selected based upon the particular method used to analyze the spectral data (e.g. diffuse reflectance, clear transmission, or diffuse transmission as discussed in the detailed description). Preferably, the entries are contained in an external data file, so that the user may change the list to conform to his own needs and judgement as to what constitutes sensible transform pairs. Preferably, the plurality of transforms applied to the spectral data includes at least a second derivative and a baseline correction. In accordance with a further embodiment of the, present invention, transforms include, but are not limited to the following: performing a normalization of the spectral data, performing a first derivative on the spectral data, performing a second derivative on the spectral data, performing a multiplicative scatter correction on the spectral data, in performing smoothing transforms on the spectral data. In this regard, it should be noted that both the normalization transform and the multiplicative scatter correction transform inherently also perform baseline corrections.
Preferably, the normalization transform is combined with each of the first derivative, second derivative, and smoothing transforms; the first derivative transform is combined with the normalization, and smoothing transforms; the second derivative transform is combined with the normalization and smoothing transforms; the multiplicative scatter correction transform is combined with absorption-to-reflection, first derivative, second derivative, Kubelka-Munk, and smoothing transforms; the Kubelka-Munk transform is combined with the normalization, first derivative, second derivative, multiplicative scatter correction, and smoothing transforms; the smoothing transform is combined with the absorption-to-reflection, normalization, first derivative, second derivative, multiplicative scatter correction, and Kubelka-Munk transforms; and the absorption-to-reflection transform is combined with the normalization, first derivative, second derivative, multiplicative scatter correction, and smoothing transforms. In this manner a set of transformed and untransformed calibration and validation data sets are created.
In a further preferred embodiment, the plurality of transforms applied to the spectral data may further include performing a Kubelka-Munk function, performing a Savitsky-Golay first derivative, performing a Savitsky-Golay second derivative, performing a mean-centering, or performing a conversion from reflectance/transmittance to absorbance.
In one preferred embodiment, the data transforms include performing a second derivative on the spectral data; and performing a normalization, a multiplicative scatter correction or a smoothing transform of the spectral data. In another preferred embodiment the data transforms include performing a normalization of the spectral data; and a smoothing transform, a Savitsky-Golay first derivative, or a Savitsky-Golay second derivative of the spectral data. In another embodiment the data transforms include performing a first derivative of the spectral data; and a normalization, a multiplicative scatter correction, or a smoothing transform on the spectral data.
The plurality of data transforms in the embodiments described above may also include a ratio transform, wherein the ratio transform includes a numerator and a denominator and wherein at least one of the numerator and the denominator is another transform. Most preferably: the numerator comprises one of a baseline correction, a normalization, a multiplicative scatter correction, a smoothing transform, a Kubelka-Munk function, or conversion from reflectance/transmittance to absorbance when the denominator comprises a baseline correction; the numerator comprises a normalization when the denominator comprises a normalization; the numerator comprises a first derivative when the denominator comprises a first derivative; the numerator comprises a second derivative when the denominator comprises a second derivative; the numerator comprises a multiplicative scatter correction when the denominator comprises a multiplicative scatter correction; the numerator comprises a Kubelka-Munk function when the denominator comprises a Kubelka-Munk function; the numerator comprises a smoothing transform when the denominator comprises a smoothing transform; the numerator comprising a Savitsky-Golay first derivative when the denominator comprises a Savitsky-Golay first derivative; and/or the numerator comprises a Savitsky-Golay second derivative when the denominator comprises a Savitsky-Golay second derivative.
One or more of a partial least squares, a principal component regression, a neural net, a classical least squares (often abbreviated CLS, and sometimes called The K-matrix Algorithm) or a multiple linear regression analysis (MLR calculations may, for example, be performed using software from The Near Infrared Research Corporation, 21 Terrace Avenue, Suffern, N.Y. 10901) are then performed on the transformed and untransformed (i.e. NULL transform) calibration data sub-sets to obtain corresponding modeling equations for predicting the amount of the target substance in a sample. Preferably, the partial least squares, principal component regression and multiple linear regression are performed on the transformed and untransformed calibration and validation data sets.
The modeling equations are ranked to select a best model for analyzing the spectral data. In this regard, for each sample in the validation sub-set, the system determines, for each modeling equation, how closely the value returned by the modeling equation is to the constituent value(s) for the sample. The best modeling equation is the modeling equation which, across all of the samples in the validation sub-set, returned the closest values to the constituent values: i.e., the modeling equation which provided the best correlation to the constituent values. Preferably, the values are ranked according to a Figure of Merit (described in equations 1 and 2 below).
In accordance with a second embodiment of the present invention, a method for generating a modeling equation is provided comprising the steps of (a) operating an instrument so as to generate and store a spectral data set of diffuse reflectance, clear transmission, or diffuse transmission spectrum data points over a selected wavelength range, the spectral data set including spectral data for a plurality of samples; (b) generating and storing a constituent value for each of the plurality of samples, the constituent value being indicative of an amount of a target substance in its corresponding sample (c) dividing the spectral data set into a calibration sub-set and a validation sub-set; (d) transforming the spectral data in the calibration sub-set and the validation sub-set by applying a plurality of a first mathematical functions to the calibration sub-set and the validation sub-set to obtain a plurality of transformed validation data sub-sets and a plurality of transformed calibration data sub-sets; (e) resolving each transformed calibration data sub-set in step (d) by at least one of a second mathematical function to generate a plurality of modeling equations; (f) generating a Figure of Merit (xe2x80x9cFOMxe2x80x9d) for each modeling equation using using the transformed validation data set of step (d); and (g) ranking the modeling equations according to the respective FOMs, wherein the FOM is defined as
FOM (without Bias) FOM={square root over ((SEE2+2*SEP2)/3)}xe2x80x83xe2x80x83(1)
FOM (with Bias) FOM={square root over ((SEE2+2*SEP2+W*b2)/(3+W))}xe2x80x83xe2x80x83(2)
where SEE is the Standard Error of Estimate from the calculations on the calibration data, SEP is the Standard Error of Estimate from the calculations on the validation data, b is the bias of the validation data (bias being the mean difference between the predicted values and corresponding constituent values for the constituent) and W is a weighting factor for the bias.
In accordance with a third embodiment of the present invention, a computer executable process, operative to control a computer, stored on a computer readable medium, is provided for determining qualitative spectroscopic analysis of a set of data on a computer readable medium, the set of data including, for each of a plurality of samples, corresponding spectral data and a corresponding constituent value, the process comprising the steps of: dividing the spectral data into a calibration sub-set of spectral data and a validation sub-set of spectral data; applying a plurality of data transforms to the spectral data in the validation sub-set and the calibration sub-set, preferably singularly and two-at-a-time as described above; applying one or more of a partial least squares, a principal component regression, a neural net, or a multiple linear regression analysis on the transformed and untransformed data sets of the spectral data in the calibration sub-set to obtain a plurality of modeling equations; applying the spectral data in the validation sub-set to each of the plurality of modeling equations to obtain corresponding values; and processing the values in order to select a best modeling equation for analyzing the spectral data. Preferably, the values are processed according to a Figure of Merit (described in equations 1 and 2 above), and the modeling equations are ranked according to the calculated FOM value, with the modeling equation with the lowest FOM value being designated as the best modeling equation.
In other aspects of the above embodiments of the present invention the modeling equation are ranked: (1) as a function of the standard error of estimate (SEP) of the validation data; (2) as a function of the standard error of estimate (SEE) of the calibration data and the standard error of estimate (SEP) of the validation data; or (3) as a function of a weighted average of standard error of estimate (SEE) of the calibration data and the standard error of estimate (SEP) of the validation data.
In accordance with another aspect of the embodiments described above, the instrument which generates the spectral data is one of a spectrophotometer, a spectral detector receptive of spectra from the spectrophotometer, a data station receptive of transmittance spectra from the detector, and, most preferably, a near infrared diffuse reflectance detector.
In accordance with another aspect of the embodiments described above, the wavelengths of the spectral data in the calibration and validation sets is between 0.7 to 2.5 xcexcm.
In accordance with yet another aspect of the embodiments described above, the spectral data set is generated from a natural product sample, a process development sample, or a raw material sample or samples generated by biological processes including for example blood samples used in predicting clinical chemistry parameters such as blood glucose.