1. Field of the Invention
This invention relates in general to making calibrated measurements of analytes in samples, which are illuminated with electromagnetic radiation, so as to produce a scattered spectrum. Specifically, the invention permits determination of the concentration of an analyte in complex mixtures, wherein there is significant spectral overlap between the analyte and other compounds present in the sample.
2. Background and Relevant Art
The field of chemometrics is primarily devoted to mathematical techniques whereby the concentration or presence of a target analyte can be ascertained from data which contains signals from other compounds. A typical problem occurs when the data consist of spectra, and the sample consists of a mixture of compounds, one or more of the spectra of such compounds overlapping with that of the analyte of interest. In general, there are several classes of additional information which can be helpful in isolating the signal of the analyte. The spectrum of the analyte may be measured in advance in a sample which does not contain other compounds. Alternatively, if it is not convenient to measure the analyte by itself, a preparation can be made without the analyte, and the spectrum of this preparation ascertained. Then the analyte can be added to the preparation, and a second spectrum taken. When the first spectrum is subtracted from the second, the resulting spectrum should be that of the analyte. Similarly, the spectra of other compounds which are thought to be present in the sample can sometimes be measured beforehand.
In addition, it is sometimes possible to make an independent measurement of the concentration of an analyte in the sample by a second method. If multiple samples are available wherein independent measurements of the analyte have been made, and the analyte is present in differing concentrations in these samples, it may be possible to calibrate the spectroscopic data on the basis of these independent measurements. Then, when an additional sample is presented to the spectroscopic apparatus, wherein the concentration of the analyte is not known, the calibration obtained from the prior set of samples can be used to ascertain the concentration of the analyte in the new sample.
The set of samples wherein the concentration of the analyte has been ascertained by an independent method and which are used to create the calibration of the spectroscopic apparatus is called the “training set.” The new sample or samples wherein the concentration of the analyte is unknown is called the “test set.” The spectra of the analyte or of other substances present in the samples may or may not be known.
A common means of ascertaining whether the calibration will properly predict the concentration of an analyte in a new sample is called “cross-validation.” In this process, the concentration of the analyte in all the samples is measured by independent means. The set is then segmented into two subsets. One of the two subsets consists of the training set and the other subset will be the test set. The concentration of the analyte in the test set is predicted by the calibration which is obtained from the training set. These predictions can then be compared with the actual, independently measured concentration of the analyte in the test set. The assignment of samples to either the training or test sets can be permuted in many patterns, hence, the concentration of the analyte in every sample or subset of samples can be predicted by the remaining samples.
A number of basic algorithms have been developed to create a calibration based on data from a training set. Let the matrix X consist of spectra from the multiple samples. Let Y be the independently measured concentrations of the analyte or analytes in each sample. If there is one analyte which has been measured independently, Y is a vector consisting of concentration versus sample. If multiple analytes have been independently measured, then Y is a matrix. The purpose of the algorithm is to calculate a model which when applied to new spectra from a new sample will correctly predict the concentration of the analyte or analytes of interest. An adequate treatment of the basic approaches can be found in “Multi-Way Analysis,” by A. Smilde, R. Bro, and P. Geladi, Pub. John Wiley and Sons, 2004, ISBN-0-471-98691-7.
If the spectra of all the compounds in the mixture are known then it is sometimes possible to use ordinary regression to extract the concentration of any one of the compounds from a spectrum of the mixtures. Even in this case, the regression may be ill-conditioned because the spectra of the substances will not, in general, be an orthogonal set. In regression, when the independent variables are partially colinear, the error in the weights assigned to the independent variables may be excessive. Also, in many important cases, the spectrum of every compound in the mixture may not be known. Frequently, some of the significant compounds in the mixture have not been identified.
To resolve the problems associated with ordinary regression, a variety of techniques have been developed that do not rely on a priori knowledge of the spectra of the compounds in the mixture. In principle components analysis, the following decomposition of the data is made:
                              X          ij                =                                            ∑                              p                =                1                            ncomp                        ⁢                                          P                ip                            ⁢                              T                jp                                              +                      ɛ            ij                                              (        1        )            where the Pip are called the “loadings,” the Tjp are called the “scores,” and εij is the residue after the specified number of components are derived. In our convention, the loadings will be spectra and the scores will be sample sequences, which for example, may be time series. The Pip are an orthonormal set, and the Tjp are an orthogonal set. The number of components for the decomposition is ncomp. This method is standard principal component decomposition, a well known method, which is described for example in Principal Component Analysis, I. T Jolliffe, 2004, ISBN 0-387-95442-2.
In general, the loadings Pip, are not the spectra of compounds in the mixture. The principle components are not unique in providing a basis for the plane of closest fit to the data, X. The loadings, P, can be rotated in the plane of closest fit, so while the plane is unique, the basis vectors that describe it are not. Therefore, since the P's can be rotated to make a decomposition of equally low error, these spectra can have no inherent relation to the actual spectra of the compounds. Furthermore, it has already been noted that the spectra of the compounds are not, in general, orthogonal, whereas the P's are orthonormal by construction.
It is possible to use the scores, T, in regression, a technique known as principle components regression. The advantage is that the T's are orthogonal so that the regression will not be ill-conditioned.
Another example of a technique requiring no a priori knowledge of the spectra of the compounds is that of partial least squares regression, a rigorous discussion of which is found in “The Geometry of Partial Least Squares,” A. Phatak, S. De Jong, Journal of Chemometrics, Vol. 11, pgs. 311-338 (1997). The algorithm minimizes the error of estimation of the spectral data, and of the independently determined concentrations iteratively.
One serious difficulty with the above mentioned techniques is the danger of over-fitting the data. Over-fitting in this context is defined as either training the method based on spurious correlations in the data, or else training the method based on other analytes which are correlated with the target analyte. In either case, the problem arises because the data contains limited information about the true analyte while containing a rich enough set of interfering substances or signals. In practice, the set of data Y, which is being fit is not very complex in the sense that it could be adequately represented mathematically by relatively few parameters. The probability of obtaining spurious correlations with one or more the components extracted in the decompositions of the data is then not negligible. A calibration constructed from over-fitting will have impaired predictive accuracy for new data. In particular, in the latter case mentioned above, calibration to correlated analytes, the correlation may not be guaranteed to hold under all conditions of interest, while in the former case there is not even the possibility of a correct determination.
The difficulty may be viewed in terms of the algorithm being under-constrained, permitting spurious solutions which nevertheless model the data in the training set. Alternatively, we may say that in such circumstances it may be necessary to make more measurements so that the data being fit becomes increasingly complex. The required number of such measurements, may, however, be impractically large.
It is possible to impose constraints on the decompositions of the data, on the basis that the spectra must obey certain rules. A simple case would be to require that the loadings be non-negative. See, for more such examples, “Inclusion of Chemical Constraints in Factor Analysis to Extract a Unique Set of Solutions from Spectroscopic and Environmental Data,” T. Ozeki, and N. Ogawa, Chemometrics and Intelligent Laboratory Systems, Vol. 71, pgs. 61-72 (2004). For mixtures of sufficient complexity, however, these constraints are generally too weak to be sufficient.
It is notable that the techniques that are described above were developed for the case where none of the spectra were known in advance, yet sometimes one or more of the spectra of the compounds in the mixture are in fact known. Such a priori knowledge is in principle very useful in avoiding over-fitting, however, those algorithms which were designed for data sets where no such knowledge was available do not in general have an appropriate way of incorporating specific spectra.
In some cases, it is observed that there are sources of variance in the measured spectra of a mixture which are not related to the variances of the analyte concentration in the mixture. An example of such a case is the Raman spectra of human skin, when it is desired to measure the concentration of one or more analytes such as glucose. It is found that different sites on the skin contain significantly different concentrations of other compounds besides glucose, whose spectra overlap, in part, with that of glucose. Further, there are compositional differences between the skins of individual human beings. Existing algorithms do not provide a means of extracting these irrelevant sources of variance which can confound the measurement of the target analyte.