Spectral library searching is a commonly used method to identify chemical species in a sample. Traditionally, this is achieved by supplying a collection of spectra of known material, i.e. the spectral library, a spectrum of an unknown S, a searching algorithm, and a matching criterion C. The searching function ƒ compares the unknown spectrum with each of the known candidate Li, in the library to calculate a matching index Pi,Pi=ƒ(S,Li)  Equation 1and the candidates with matching indices above the criterion C are deemed the likely identity of the unknown.
There are a multitude of well-known search algorithms, for example, methods based on spectral correlation, Euclidean distance, least square (see S. R. Lowry, “automated Spectral Searching in Infrared, Raman and Near-Infrared spectroscopy”, J. Wiley & Sons, pp 1948-1961), sum of absolute difference, and vector dot product (see J. B. Loudermilk et al, “Novel Search Algorithms for a Mid-Infrared Spectral Library of Cotton Contaminants”, Applied Spectroscopy, Volume 62, Number 6, 2008).
These correlation based methods have a common shortcoming in that they do not consider the probability distribution of the spectral variables in the target materials or the sample, therefore fail to answer the question “what is the probability that the sample has the same composition as a target material”.
Questions such as this fall into the domain of statistical inference, and can be addressed by performing statistical analysis of data representing the target materials and the sample. Specifically, given an observation result y of a sample, n possible candidates (targets, or target materials) and their corresponding average μi (i=1 to n) as well as statistical distribution Σi, for the hypothesis Hi: y=μi, the quantity being sought is P(Hi|y), i.e. the exclusive likelihood that the sample is none other than candidate i given evidence y. This is different from P (y|Hi), which represents the probability of observing a result equal to or more extreme than y under the hypothesis Hi: y=μi. P(y|Hi) is the so called p-Value for candidate i. Bayes' theorem gives the relationship between the two:
                              P          ⁡                      (                                          H                i                            ❘              y                        )                          =                                            P              ⁡                              (                                  y                  ❘                                      H                    i                                                  )                                      ·                          P              ⁡                              (                                  H                  i                                )                                                                        ∑                              j                =                1                            n                        ⁢                                          P                ⁡                                  (                                      y                    ❘                                          H                      j                                                        )                                            ·                              P                ⁡                                  (                                      H                    j                                    )                                                                                        Equation        ⁢                                  ⁢        2            where P(Hi) is the prior probability of the sample being candidate i, that is, the probability without evidence y. In contrast, P(Hi|y), the probability the sample being candidate i and not anything else after considering the evidence y, is called posterior probability.
The prior probability is assigned based on prior beliefs, and can be a evenly divided number, P(Hi)=1/n, or weighted by other properties such as material state, color, etc. Thus the key to the problem is solving for the p-Value. The data y may contain a single variable, or multiple variables, and the corresponding statistical methodology falls into the category of univariate and multivariate analysis, respectively. Univariate analysis is simple, but is based on very limited information. Spectroscopy are multivariate techniques that provide measurements of a large number of variables, therefore can provide more reliable answers. Theoretically, if the intensity distribution of the spectrum representing a target material is known, the p-Value of an observed spectrum y can be calculated. In reality, however, typical spectra contain hundreds to tens of thousands of wavelength elements, and to ascertain the distribution of such high dimensions would require an impractically large number of spectra (the so called “curse of dimensionality”). Often, it is assumed that all of these variables follow a normal distribution, hence the spectral vector follows a multivariate normal distribution. Then, with a known mean spectral vector μ, and a population covariance matrix Σ, the probability density function for a measured spectral vector y of dimension q is given by
                              g          ⁡                      (            t            )                          =                              1                                                            (                                                            2                      ⁢                      π                                                        )                                q                            ⁢                                                                  Σ                                                                    1                  /                  2                                                              ⁢                      e                          -                              t                2                                                                        Equation        ⁢                                  ⁢        3                                          t          ≡                      Z            2                          =                                            (                              y                -                μ                            )                        T                    ⁢                                    ∑                              -                1                                      ⁢                          (                              y                -                μ                            )                                                          Equation        ⁢                                  ⁢        4            
Where Σ−1 and |Σ| are the inverse matrix and determinant of E, respectively.
Z2 is the so called Mahalanobis distance, and follows χ2(q), a chi-squared distribution with q degrees of freedom (DoF), therefore, under the hypothesis y is a representation of the target material, the probability of getting a measured spectrum equal to or more extreme than y, i.e. the p-Value, can be calculated as the cumulative probability from Z2 to ∞:p−Value=∫Z2∞g(t)dt  Equation 5A lower p-Value indicates a less likely occurrence.
In practice, mean spectral vector μ is estimated by y, the average of n measured spectra after some normalization processes, the population covariance matrix Σ is replaced by the sample covariance matrix S, and the quantityT2=(y−y)TS−1(y−y)  Equation 6follows the Hotelling distribution. The p-Value can be calculated as the cumulative probability from T2 to ∞. The computation of S−1 and |S|, however, requires S to be non-singular, which in turn requires at least q measured spectra, still a prohibitively expensive undertake. Even if such data is available, the fact that all spectra belong to the same material means that many variables are highly correlated. To those skilled in the art of linear algebra, it is obvious that such correlations among variables would make |S| essentially 0, rendering the probability density function and the p-Value unstable, or indeterminable.
Therefore, it is a central problem in multivariate analysis to identify and deal with highly correlated variables. One oversimplified approach is to assume all the variables are independent, therefore all off diagonal elements of S are set to 0, and the computation of S−1 and |S| becomes straightforward. Such simplification is appropriate only if all the variables vary independently from each other, such as when variations are limited to random noises, such as signal shot noise, detector dark noise, readout noise, etc. In reality, this is rarely the case, as measurement conditions, or sample itself can impose variations that are highly correlated among certain variables. For example, in Raman spectroscopy, relative peak intensities can be affected by excitation polarization, sample focus position, sample orientation, etc. in Near Infrared spectroscopy, such variations can be induced by sample temperature, particle size, pathlength, etc. As relative peak intensity changes, intensities of wavelength elements that belong to the same peak often vary in unison. Treating such variations as uncorrelated would produce wrong p-Values. As an example, consider a simple case where a spectrum consists of 2 peaks, each covers a segment of 10 wavelength elements of equal intensities. As each one of the 10 variables within either peak is completely correlated with the other 9, they should be combined into a single variable, resulting in a total of 2 variables, each corresponding to the one peak. The Mahalanobis distance Z2(2) follows χ2(2). However, simply treating all 20 variables as independent would result in a Mahalanobis distance Z2(20)=10Z2(2) following χ2(20). For Z2(2)=1.0, the p-Value calculated for χ2(2) and χ2(20) are 0.606 and 0.968, respectively. Using a rejection criteria of α=0.05, both p-Values pass the test. However, for Z2(2)=4.0, the p-Value calculated for χ2(2) and χ2(20) are 0.135 (pass) and 0.0005 (fail), respectively.
Various variable reduction techniques exist that identify such correlated variables and group them together as a single component, thus reducing the dimension of the problem to a manageable level. Principal component analysis (PCA) is one such method well known to those skilled in the art of chemometric spectral analysis. In PCA, a number of spectra are acquired of a target material, the covariance matrix is used to derive m eigenvectors corresponding to the m largest eigenvalues. By linearly combining the q variables into m (m<<q) mutually orthogonal principal components (PCs) that explain the majority of the variance in the covariance matrix, each original spectrum of q dimension is transformed into a new one of m dimension, represented by m scores. The covariance matrix S is reduced from q×q to m×m dimensions. Furthermore, these PCs are uncorrelated, and the new sample covariance matrix S is simplified to a diagonal matrix. The model, consisting of the average spectrum, the m PCs and eigenvalues, are then tested against any measured spectrum y to determine its p-Value, by means of calculating the new Mahalanobis distance in the score space, now called score distance (SD). However, a major drawback of PCA is that the loading of the original q variable in the PCs are heavily weighted toward the ones that exhibit large variations in the training spectra, and the wavelength regions that exhibited little change are essentially discounted. If a test spectrum happens to have extra peaks in such regions, for example due to contaminants, the p-Value will not decrease significantly, hence causing false positive errors.
Classification methods such as Soft Independent Modeling of Class Analogy (SIMCA) compensates this deficiency by considering the orthogonal distance (OD), which is the residual variance not explained by the PCA model. However, since the OD contains contribution from potentially a large number, up to (q-m) of independent variables of different magnitude, it is impossible to estimate its distribution without a large number of samples. Therefore, there is no established statistical model describing the combination of SD and OD. Pomerantsev proposed that the OD follows a χ2 distribution, and its DoF is calculated based on the mean and standard deviation of OD from a relatively small number of measurements. In practice, the DoF obtained in this way is often quite large and unstable, making the method untrustworthy.
Another problem with PCA based methods for p-Value calculation is that they can only account for spectral variations that are captured in the training data (the model). Variations outside the model but nevertheless belong to the target material would be considered outliers, resulting in false negatives. To avoid false negatives, a robust PVA model typically require the collection of a large number of spectra of the target material to capture as much variation as possible.
U.S. Pat. No. 7,254,501 B1 by C. D. Brown et al. disclosed a method that takes into account of the precision state of the unknown spectrum ΣS, the precision state of the library spectrum Σi, as well as other information such as sample form, color, odor, collectively codified as Ψ, thereby provides a probability based matching index. However, Brown's method does not provide means of variable reduction to deal with highly correlated variables, therefore will run into problems described previously, that is, either the singularity problem of the covariance matrix S, or unreliable calculated p-Values.
What is needed, therefore, is a spectral analysis method that incorporate a variable reduction technique and can answer the question “what is the probability that the sample has the same composition as a target material”. Specifically, the method shall provide means of calculating the p-Value overcoming the aforementioned problems. To be practical, such a method should not require the collection of a large number of spectral of the target material or the test sample. To be useful, it should be robust enough to handle spectral variations of the target material and the test sample, and specific enough to differentiate materials having similar but statistically different spectral signatures.