1. Field of the Invention
The present invention relates to the development of optimized multivariate models through a calibration set of empirical data. More particularly, the present invention relates to the automatic selection of a data sub-set from a larger set of potential calibration data that provides improved performance (accuracy) and robustness.
2. Description of Related Technology
In general, near-infrared (NIR) diffuse reflectance spectroscopy involves the illumination of a spot on the body with low energy near-infrared light (700-2500 nm). The light is partially absorbed and scattered, according to its interaction with chemical components within the tissue, prior to being reflected back to a detector. The absorbance of light at each wavelength is a function of the structural properties and chemical composition of the tissue. Tissue layers, each containing a unique heterogeneous particulate distribution, affects light absorbance through scattering. Chemical components such as water, protein, fat, and analytes absorb light proportionally to their concentration through unique absorption profiles or signatures. The measurement of glucose is based on detecting the magnitude of light scatter and attenuation related to its concentration as spectrally manifested through the use of a calibration.
A calibration is a mathematical model, g(?), that relates a set of M independent variables, xεM×1, to a dependent variable, y throughŷ=g(x)where ŷ is an estimate of the dependent variable. In the linear case,ŷ=xG+bwhere GεM×1 is a regression vector and b is an offset. The process of calibration involves the determination of g(?) on the basis of an exemplary set of N paired data points or samples, called the “calibration set”. Each sample consists of a measurement of the independent variable, x, and an associated measurement of a dependent variable, y. The method for designing the structure of g(?) is through the process of system of identification [L. Ljung, Systems Identification: Theory for the User, 2d.ed., Prentice Hall (1999)]. The model parameters are calculated using known methods including multivariate regression or weighted multivariate regression [N. Draper, H. Smith, Applied Regression Analysis, 2d.ed., John Wiley and Sons, New York (1981)], principal component regression [H. Martens, T. Naes, Multivariate Calibration, John Wiley and Sons, New York (1989)], partial least squares regression [P. Geladi, B. Kowalski, Partial least-squares regression: a tutorial, Analytica Chimica Acta, 185, pp.1-17, (1986)], or artificial neural networks [S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, Upper Saddle River N.J. (1994)].
As indicated above, a primary use of a calibration is for the estimation of a dependent variable on the basis of an independent measurement. In the case of the non-invasive measurement of glucose through near-infrared spectroscopy, the dependent variable is the subject's glucose concentration and the independent variable is a near-infrared spectrum, after suitable processing. However, the use of calibrations is not limited to non-invasive measurement of glucose but, rather, applies to any application in which an indirect measurement of a property value (dependent variable) is required on the basis of more than one independent variable.
The design and collection of the calibration set is of great importance because the performance of the resulting model is intimately linked to the quality of the calibration data [see, for example, T. Isaksson, T. Naes, Selection of samples for calibration in near-infrared spectroscopy. Part I: general prinicples illustrated by example, Applied Spectroscopy, Vol. 43, No. 2, pp. 328-335, 1989 and T. Isaksson and T. Naes, Selection of samples for calibration in near-infrared spectroscopy. Part II: selection based on spectral measurements, Applied Spectroscopy, Vol. 44, No. 7, pp. 1152-1158, 1990]. A minimal requirement is that the data in the calibration set must comprehensively represent the potential variation in x and y. However, this criterion does not guarantee a calibration set will be sufficient. In particular, two significant problems related to the calibration set can adversely effect the determination of g(?). First, individual paired data points can contain errors in x or y as a result of measurement error, poor instrument performance and other anomalies. Such data points, often referred to as “outliers”, should be removed to avoid a poor estimate of g(?).
Second, interfering variables or constituents in x that are present at the time of data collection introduce the potential for unintended or ancillary correlations between the dependent variable and other unrelated variables. If this correlation is manifested in x and is consistent throughout the calibration set, a false calibration will result that fails when this correlation is absent. In the case of noninvasive glucose measurement, the potential for false correlations is a consequence of the complexity of the sample and the measurement process [see M. Arnold, J. Burmeister, G. Small, Phantom glucose calibration models from simulated noninvasive human near-infrared spectra, Analytical Chemistry, vol. 70:9, pp. 1773-1771 (May 1, 1998)]. The multifaceted matrix of blood and tissue constituents introduces the potential for unintended correlations between glucose and other analytes.
In addition, the glucose levels of subjects may move relatively slowly throughout the course of a data collection period and may correspond consistently with other variables such as time, sample order, instrument drift, room temperature, patient skin temperature and skin hydration. Therefore, experimental conditions can lead to spectral aberrations that fortuitously vary consistently with glucose. Models based on data containing fortuitous and spurious correlations between glucose and other variables are erroneous and therefore not suitable for directing insulin therapy in diabetics.
Therefore, the creation of a suitable calibration set is generally performed on the basis of an experimental design and subsequent execution of the experiment [see H. Martens, T. Naes, Multivariate Calibration, John Wiley and Sons, New York (1989)]. However, there are often circumstances that prohibit a comprehensive experimental design and/or involve uncontrollable samples. For example, when the target apparatus involves the measurement of an attribute of a biological system, such as near-infrared measurement of glucose in humans, absolute control of the diversity of factors affecting calibration is difficult. As reported by S. Malin, T. Ruchti, An Intelligent System for Noninvasive Blood Analyte Prediction, U.S. Pat. No. 6,280,381 (Aug. 28, 2001), commonly-owned with the current application, uncontrollable chemical, structural, and physiological variations occur in tissue that produce dramatic and nonlinear changes in the optical properties of the tissue sample.
In such circumstances, an additional step of selecting a suitable subset of calibration data from a larger data is desirable. Several methods have been reported that base the selection of a calibration subset on the basis of the independent variable [see, for example D. E. Honigs, G. M. Hieftje, H. L. Mark and T. B. Hirschfeld, Unique-Sample Selection via Near-Infrared Spectral Subtraction, Analytical Chemistry, Vol. 57, No. 12, pp. 2299-2303, 1985; E. Bouveresse, C. Harmn, D. L. Massart, I. R. Last, and K. A. Prebble, Analytical Chemistry, Vol. 68, pp. 982-990, 1996; and Isaksson, et al., supra (1)]. Such methods fail to make use of the dependent variables and do not guarantee to produce unbiased models. In addition, the problem of fortuitous correlations remains unaddressed in these reports.
J. M. Brown, Method for optimizing multivariate calibrations, U.S. Pat. No. 6,233,133 (Apr. 24, 2001) describes a method for selecting a subset of samples for calibration on the basis of a larger set to minimize the bias in the y-block while ensuring that the x-block range is adequately spanned by the calibration set. However, the method fails to address the problem of ancillary correlations that, in certain applications, are pervasive within the larger data set. In addition, the method of selection is undirected and based upon a fitness function that depends upon the results from a calibration model that is calculated for each potential subset. Consequently, the results may vary significantly on the basis of the method of calibration and the determination of the suitable rank of the calibration model.
Fundamentally, no method has been reported to automatically select calibration samples that minimizes the potential for a calibration model that includes spurious correlations. In addition, no automated process has been designed to enhance the accessibility of the target signal within the calibration set while minimizing the correlation to interfering variables. Finally, no method has been reported that automatically identifies and removes invalid samples from a calibration set.
In view of the problems left unsolved by the prior art, there exists a need for a method to optimize the calibration set in a manner that reduces the likelihood of spurious correlations. Further, it would be beneficial to provide a method of selecting calibration samples that enables the efficient extraction of the target signal. Finally, it would be a significant advancement if the method were automatic and, as part of its operation, removed invalid samples.