The present invention relates generally to the evaluation of objects using spectral data, and more particularly, but not exclusively, to a method for evaluating spectrum data to determine whether samples include a particular characteristic.
As used herein, the term xe2x80x9cindexed data setxe2x80x9d or xe2x80x9cspectrumxe2x80x9d refers to a collection of measured values (xe2x80x9cresponsesxe2x80x9d) where each response is related to one or more of its neighbor elements. The relationship between the one or more neighbor elements may be, for example, categorical, spatial, or temporal. In addition, the relationship may be explicitly stated or implicitly understood from knowing the type of response data and/or how such data were obtained. When a unique index, either one-dimensional or multi-dimensional, is assigned (implicitly or explicitly) to each response, the data are considered indexed. One-dimensional indexed data is defined as data in ordered pairs (index value, response). The index values represent values of a parameter such as time, distance, frequency, or category; the response values can include but are not limited to signal intensity, particle or item counts, or concentration measurements. A multi-dimensional indexed data set or spectrum is also ordered data, but with each response indexed to a value for each dimension of a multi-dimensional array. Thus a two-dimensional index has a unique row and column address for each response (index value1, index value2, response).
Spectral/chromatographic data (as that produced by matrix-assisted laser desorption/ionization-mass spectrometry (MALDI-MS) or gas chromatography) is gathered and analyzed to characterize samples. For example, such data sets may be analyzed in an attempt to determine whether or not a known substance is present in the sample. In other applications, the data may be analyzed in an attempt to evaluate whether a chemical or biological process is performing within acceptable bounds. Some existing methods for analysis include pattern recognition techniques and visual interpretation of spectrum plots. Many techniques use principal components analysis, including partial least squares and principal component regression methods.
The identification and/or characterization of significant or useful features in the analysis of indexed data is a classic problem. Often this problem is reduced to separating the desired signal from undesired noise by, for example, identifying peaks that may be of interest. For indexed data, each of such peaks appears as a deviation, that is to say a rise and a fall (or fall and rise), in the responses over consecutive indices. However, background noise can also result in such deviations of responses leading, for example, to the identification of false peaks in indexed data.
Traditionally, peak detection has been based upon identifying responses above a threshold value. Whether this peak detection has been performed manually or by use of an automated tool, threshold selection has been a critical function that has resisted an objectively optimal solution. Thus such previously known methods for threshold selection typically require arbitrary and subjective operator/analyst-dependent decision-making and are therefore an art. The effectiveness of such artful decision-making using these known traditional methods, and peak detection as a result, is also affected by signal-to-noise ratio, signal drift, and other variations in the baseline signal. Consequently, the operator/analyst often has had to apply several thresholds to the responses over different regions of indices to capture as much signal as possible. This approach has been shown to yield results that are not reproducible, to cause substantial signal loss, and to be subject to operator/analyst uncertainty.
An example of the problems with traditional peak detection and characterization algorithms and methods is illustrated by the development of statistical analysis methods for MALDI-MS. The MALDI-MS process begins with an analyte of interest placed on a sample plate and mixed with a matrix. The matrix is a compound selected to absorb specific wavelengths of light that are emitted by a selected laser. Light from the laser is then directed at the analyte mixture causing the matrix material to become ionized. This ionization of the matrix material, in turn, ionizes some molecules of the analyte which become analyte ions 100 (FIG. 1). A charge is applied just beyond the source region to extract ions into flight tube 102 and at a detector 104 to attract analyte ions 100, where detector 104 measures the ionic charge that arrives over a time interval. This measure of charge is converted to an abundance of ions, and the measured flight time of each packet of ions is converted to a mass/charge (m/z) ratio based on flight time measurements of 2-3 known analytes. Since ions 100 arrive at detector 104 in a dispersed packet that spans multiple sampling intervals, ions 100 are binned and counted over several m/z units as illustrated in FIG. 2.
Currently used algorithms require an operator/analyst to specify a detection threshold 200 for the intensities observed so that only peaks 202 that exceed this specified threshold will be detected and characterized. This procedure for setting the detection threshold appears conceptually appealing and suggests that m/z values for which no ions are present will be read as having a baseline relative abundance, while m/z values for which ions are present will result in a peak. However, as a result of this procedure, peaks 202 that are detected for a specific analyte are not only dependent on the MALDI-MS instrument used but also on the skill of the operator/analyst in setting the detection threshold 200 used for the analysis. If such a user-defined threshold 200 is too low, noise might erroneously be characterized as a peak; whereas if threshold 200 is too high, small peaks might erroneously be ignored as noise. Thus the manual setting of detection threshold 200 introduces variability that makes accurate statistical characterization of MALDI-MS spectra difficult. In addition, baseline noise is not constant over the entire data collection window and such variability decreases even further the effectiveness of current peak detection algorithms based on baseline thresholding. Also related to the problem of distinguishing signals from noise is the bounding uncertainty of the signal. It is well known that replicate analyses of a given sample often produce slightly different indexed data due to instrument variability and other factors not tied to an operator/analyst.
The related disclosures cited above provide improved methods of identifying significant features of test spectra. There remains, however, a need for improved methods of testing samples using peak indices and characteristics that are discovered using such techniques. Such applications include qualitative analysis (wherein one attempts to determine whether a sample does or does not contain a particular substance) and process control (wherein one attempts to detect at what point in time a process degrades to an unacceptable state).
The goal of process control in this context is to take sample spectra at given time epochs, and based on those spectra, to determine when the process begins to degrade or fail (i.e., becomes xe2x80x9cout of controlxe2x80x9d). Several techniques have been developed for control of analytical processes. Many of these methods take a series of sample spectra and compare each to the statistical distribution of a reference spectrum to determine if that spectrum falls inside or outside the expected range of variation for an under-control process. Such methods are useful for identifying dramatic changes in a monitored process, but they are generally deficient when processes undergo gradual or subtle changes over time.
An often-used process control method in chemometrics is the Hotelling T2 chart operating on principal components of the original spectrum/chromatogram. (See, e.g., Russell, E. L.; Chiang, L. H.; Braatz, R. D.; Chemometrics and Intelligent Laboratory Systems, vol. 51, pp. 81-93 (2000); Wilkstrom, C.; Albano, C.; Ericksson, L.; Friden, H.; Johansson, E.; Nordahl, A.; Rannar, S.; Sandberg, M.; Kettaneh-Wold, N.; Wold, S.; Chemometrics and Intelligent Laboratory Systems, vol. 42, pp. 221-231 (1998).) The T2 chart is a multivariate alternative to standard univariate process control methods (see, e.g. Zwillinger, Daniel (ed.), Standard Mathematical Tables and Formulae, 30th Ed. (CRC Press, 1996)), which monitor each variable (e.g., principal component or peak) separately. The T2 chart is a powerful alternative to univariate process control methods because a single test can be used to monitor all variables, and correlation between variables can be accounted for using traditional multivariate techniques.
The T2 test is easy to implement, and requires monitoring of only one statistic. The method can be used to monitor dramatic changes in a monitored process, but it is not adept at identifying gradual or subtle changes over time because it has no memory for recent observations-each observation is compared individually to the training sample. The decision for the current observation is not influenced by the test statistic for preceding observations even if they were almost out of control.
Sequential analysis has been developed to overcome this insensitivity to subtle process changes. Conceptually, a sequential test performs a random sample size hypothesis test:
xe2x80x83H0=Spectrum matches reference sample
HA=Spectrum does not match reference sample
on each spectrum in the sequence until an alternate (out of control) decision is made. Whereas the standard chemometrics methods test a single spectrum at a time for degradation or failure, a sequential test relies on the combined information from current and past observations to make this decision. As a result, sequential tests have been proven to be an improvement in the sense that they are more sensitive (i.e., detect process change more rapidly) than their non-sequential counterparts. (See Ghosh, B. K.; Sen, P. K.; Handbook of Sequential Analysis (Marcel Dekker, Inc., N.Y. 1991) for background on sequential testing.)
In 1954, Page developed a univariate sequential test for the purpose of rapidly detecting a change in processes at random time points. (See Page, E.; Biometrika (1954) pp. 100-114.) Page""s test is essentially a modification of Wald""s sequential probability ratio test (SPRT) (cf. Wald, A.; Sequential Analysis (Wiley, 1947)) and is known as the cumulative sum (xe2x80x9cCUSUMxe2x80x9d) procedure. Developed for testing the parameters (such as the mean) of univariate random variables, this procedure has many desirable properties. Nonetheless, success of CUSUTM analysis depends on proper selection of its input parameters and accurate modeling of the underlying random variables. There is, therefore, a need for improved modeling of spectral sequences and analysis thereof in relation to the CUSUM procedure.
One form of the present invention is a unique model for indexed data, applied to determine more accurately whether a sample matches a reference object. Other forms include a unique method for process control based on this unique model wherein a sequence of indexed data sets is analyzed to determine at what point in time the process degrades to an unacceptable state.
In another form of the present invention, one determines whether a sample matches a reference species by selecting N indices lj of peaks in an indexed data set characterizing the reference species; selecting a first set of probabilities pj that peaks will occur at the corresponding indices lj, respectively, of an indexed data set that characterizes the sample when the sample matches the reference species; selecting a second set of probabilities qj that peaks will occur at indices lj, respectively, of an indexed data set that characterizes the sample when the sample does not match the reference species; choosing a threshold Kc; obtaining an indexed observation data set x1, x2, . . . xN, where xj∈{0, 1} and xj=1 if and only if a peak is present in the sample at lj; deciding that the sample matches the reference species if xcexxe2x89xa6Kc where       λ    =                            ∑                      1            ≤            j            ≤            N                          ⁢                  xe2x80x83                ⁢                  log          ⁡                      (                                          1                -                                  p                  j                                                            1                -                                  q                  j                                                      )                              +                        ∑                      1            ≤            j            ≤            N                          ⁢                  xe2x80x83                ⁢                              x            j                    ⁢                      log            ⁡                          [                                                                    p                    j                                    ⁡                                      (                                          1                      -                                              q                        j                                                              )                                                                                        q                    j                                    ⁡                                      (                                          1                      -                                              p                        j                                                              )                                                              ]                                            ;
and deciding that the sample does not match the reference species if xcex greater than Kc.
In another form of the present invention, a method is provided for determining whether a sample matches a reference species. From the peaks in an indexed data set, N indices lj are selected that characterize the reference species. Probabilities pi and qi are selected to reflect that peaks will occur in an indexed data set characterizing the sample when the sample does or does not, respectively, match the reference species. A threshold Kc is chosen, and an indexed observation data set X=x1, x2, . . . xN, where xj∈{0, 1} and xj=1 if and only if a peak is present in the sample at index lj. It is decided that the sample matches the reference species if xcexxe2x89xa6Kc where       λ    =                            ∑                      1            ≤            j            ≤            N                          ⁢                  xe2x80x83                ⁢                  log          ⁡                      (                                          1                -                                  p                  j                                                            1                -                                  q                  j                                                      )                              +                        ∑                      1            ≤            j            ≤            N                          ⁢                  xe2x80x83                ⁢                              x            j                    ⁢                      log            ⁡                          [                                                                    p                    j                                    ⁡                                      (                                          1                      -                                              q                        j                                                              )                                                                                        q                    j                                    ⁡                                      (                                          1                      -                                              p                        j                                                              )                                                              ]                                            ;
and it is decided that the sample does not match the reference species if xcex greater than Kc.
In various implementations of this form the invention, one or more of the xe2x80x9cselectingxe2x80x9d steps is done with iterative proportional scaling, iterative weighed leased squares, or through application of a Lancaster or latent class model.
In another form of the invention, the indices of peaks in an indexed data set characterizing a reference species are selected, and the probabilities that peaks will occur at those indices of an indexed data set that characterizes a sample matching the reference species are selected. A set of probability density functions gi(yi; xcex8i) that characterize a measurable feature yi of the peak at index li (given the presence of a peak at index li) of a data set that characterizes a sample matching the reference species. Probabilities qi that peaks will occur at indices li of an indexed data set that characterizes a non-matching sample, and probability density functions gi(yi; xcexa9i) that characterize the measurable feature yi of the peak at index li (given the presence of a peak at index li) of a data set that characterizes a non-matching sample are selected. A threshold Kc is selected, and an indexed observation data set x1, x2, . . . xN is obtained (where xj ∈{0, 1}, and xj=1 if and only if a peak is present in the sample at lj). A feature data set yi, y2, . . . yN is obtained where yi=0 if no peak is present in the sample at index li, and where yi ∈(0, 1} if a peak is present in the sample at index li. It is decided that the sample matches the reference species if       λ    =                            ∑                      i            =            1                    N                ⁢                  xe2x80x83                ⁢                  [                                    log              ⁢                              xe2x80x83                            ⁢                                                1                  -                                      p                    i                                                                    1                  -                                      q                    i                                                                        +                                          x                i                            ⁢                              {                                                      log                    ⁢                                          xe2x80x83                                        ⁢                                                                                            p                          i                                                ⁡                                                  (                                                      1                            -                                                          q                              i                                                                                )                                                                                                                      q                          i                                                ⁡                                                  (                                                      1                            -                                                          p                              i                                                                                )                                                                                                      +                                      log                    ⁢                                          xe2x80x83                                        ⁢                                                                                            g                          i                                                ⁡                                                  (                                                                                    y                              i                                                        ;                                                          θ                              i                                                                                )                                                                                                                      g                          i                                                ⁡                                                  (                                                                                    y                              i                                                        ;                                                          Ω                              i                                                                                )                                                                                                                    }                                              ]                    ≤              K        c              ,
and it is decided that the sample does not match the reference species if xcex greater than Kc. In various implementations of this invention, gi(xc2x7) is a lognormal, gamma, or Poisson density. In some implementations, the xe2x80x9cmeasurable featurexe2x80x9d is the intensity of the peak at index li, the width of the peak at index li, or a quantification of the skew of the peak at index li.
In another form of the invention, the status of a process at any point t in time is characterized by an indexed observation data set Xt={x1,t, x2,t, . . . xN,t}, where xj,t∈{0, 1}, and xj,t=1 if and only if a peak is present at time t in the sample at index lj. A first set pj and a second set qj of probabilities are selected, where j=1, 2, . . . N, and pj and qj reflect the probabilities that a peak will occur at index lj when the process is or is not (respectively) operating normally. A sequence X1, X2, . . . XT of indexed observation data sets is acquired, and the process is stopped when it is determined that the Cn equals or exceeds a predetermined value A, where
C0=0; 
Cn=Snxe2x88x92min1xe2x89xa6jxe2x89xa6n {Sj} for nxe2x89xa71; and
      S    n    =                    ∑                  1          ≤          j          ≤          N                    ⁢              xe2x80x83            ⁢              log        ⁢                  (                                    1              -                              p                j                                                    1              -                              q                j                                              )                      +                  ∑                  1          ≤          j          ≤          N                    ⁢              xe2x80x83            ⁢                        x                      j            ,            t                          ⁢                              log            ⁡                          [                                                                    p                    j                                    ⁡                                      (                                          1                      -                                              q                        j                                                              )                                                                                        q                    j                                    ⁡                                      (                                          1                      -                                              p                        j                                                              )                                                              ]                                .                    
In one implementation of this form, A is selected as a function of the desired false alarm rate of the test.
In another form of the invention, the status of a process at any point t in time is characterized by an indexed observation data set Xt={x1,t, x2,t, . . . xN,t}, where xj,t ∈{0, 1}, and xj,t=1 if and only if a peak is present at time t in the sample at index lj. A first set pj and a second set qj of probabilities are selected, where j=1, 2, . . . N, and pj and qj reflect the probabilities that a peak will occur at index lj when the process is or is not (respectively) operating normally. A first set of probability density functions gi(yi; xcex8i) and a second set of probability density functions gi(yi; xcexa9i) are selected to characterize a measurable feature yi of the peak at index li (given the presence of such a peak) in a data set that characterizes the sample when the sample matches (or does not match, respectively) the reference species. A sequence X1, X2, . . . XT of indexed observation data sets is acquired, and the process is stopped when it is determined that the Cn equals or exceeds a predetermined value A, where
C0=0;
Cn=Snxe2x88x92min1xe2x89xa6jxe2x89xa6h{Sj} for nxe2x89xa71; and
      S    n    =            ∑              i        =        1            N        ⁢          xe2x80x83        ⁢                  [                              log            ⁢                          xe2x80x83                        ⁢                                          1                -                                  p                  i                                                            1                -                                  q                  i                                                              +                                    x              i                        ⁢                          {                                                log                  ⁢                                      xe2x80x83                                    ⁢                                                                                    p                        i                                            ⁡                                              (                                                  1                          -                                                      q                            i                                                                          )                                                                                                            q                        i                                            ⁡                                              (                                                  1                          -                                                      p                            i                                                                          )                                                                                            +                                  log                  ⁢                                      xe2x80x83                                    ⁢                                                                                    g                        i                                            ⁡                                              (                                                                              y                            i                                                    ;                                                      θ                            i                                                                          )                                                                                                            g                        i                                            ⁡                                              (                                                                              y                            i                                                    ;                                                      Ω                            i                                                                          )                                                                                                        }                                      ]            .      
In one implementation of this form, A is selected as a function of the desired false alarm rate of the test. In some implementations, at least one of the probability density functions gi(xc2x7) is either a normal, lognormal, gamma, or Poisson density function.
Still another form of the present invention is a system for analyzing a sample in comparison with a reference species, comprising a processor, a memory, and a computer-readable medium. The memory stores data indicative of probabilities pj (j=1, 2, . . . N) that peaks will occur at indices lj of an indexed data set that characterizes a sample matching the reference species; data indicative of probabilities qj that peaks will occur at indices lj of an indexed data set that characterizes a sample not matching the reference species; data indicative of a threshold value; and an indexed sample data set xj characterizing the sample, wherein each xj is a binary value that indicates whether or not a peak is present at index li. The computer-readable medium is encoded with programming instructions executable by the processor to calculate a log-likelihood ratio       λ    =                            ∑                      1            ≤            j            ≤            N                          ⁢                  xe2x80x83                ⁢                  log          ⁡                      (                                          1                -                                  p                  j                                                            1                -                                  q                  j                                                      )                              +                        ∑                      1            ≤            j            ≤            N                          ⁢                  xe2x80x83                ⁢                              x            j                    ⁢                      log            ⁡                          [                                                                    p                    j                                    ⁡                                      (                                          1                      -                                              q                        j                                                              )                                                                                        q                    j                                    ⁡                                      (                                          1                      -                                              p                        j                                                              )                                                              ]                                            ,
to generate a first signal when xcex is less than the threshold value, and to generate a second signal when xcex is greater than the threshold value.
In some embodiments of this form, Kc is selected such that, given that the sample matches the reference species, P{xcex greater than Kc}xe2x89xa6xcex1 for a predetermined type I error probability xcex1. In other embodiments, the selecting steps comprise iterative proportional scaling calculations, iterative weighted least squares calculations, or application of a Lancaster model or latent class model.
Yet another form of the present invention is a method of determining whether a sample matches a reference species. The method includes selecting N indices lj of peaks in an indexed data set characterizing the reference species; selecting a first set of probabilities pj that peaks will occur at indices lj, respectively, of an indexed data set that characterizes the sample when the sample matches the reference species; selecting a first set of probability density functions gi(yi; xcex8i) that characterize a measurable feature yi of the peak at index li given the presence of a peak at index li in a data set that characterizes the sample when the sample matches the reference species; selecting a second set of probabilities qj that peaks will occur at indices lj of an indexed data set that characterizes the sample when the sample does not match the reference species; selecting a second set of probability density functions gi(yi; xcexa9i) that characterize the measurable feature yi of the peak at index li given the presence of such a peak in a data set that characterizes the sample when the sample does not match the reference species; selecting a threshold Kc; obtaining an indexed observation data set x1, x2, . . . xN where xi ∈{0, 1} and xi=1 if and only if a peak is present in the sample at li; deciding (a) that the sample matches the reference species if xcexxe2x89xa6Kc where   λ  =            ∑              i        =        1            N        ⁢          xe2x80x83        ⁢          [                        log          ⁢                      xe2x80x83                    ⁢                                    1              -                              p                i                                                    1              -                              q                i                                                    +                              x            i                    ⁢                      {                                          log                ⁢                                  xe2x80x83                                ⁢                                                                            p                      i                                        ⁡                                          (                                              1                        -                                                  q                          i                                                                    )                                                                                                  q                      i                                        ⁡                                          (                                              1                        -                                                  p                          i                                                                    )                                                                                  +                              log                ⁢                                  xe2x80x83                                ⁢                                                                            g                      i                                        ⁡                                          (                                                                        y                          i                                                ;                                                  θ                          i                                                                    )                                                                                                  g                      i                                        ⁡                                          (                                                                        y                          i                                                ;                                                  Ω                          i                                                                    )                                                                                            }                              ]      
or (b) that the sample does not match the reference species if xcex greater than Kc.
In some embodiments of this form, one or more gi(xc2x7) are lognormal densities given by                     g        i            ⁡              (                              y            i                    ;                      θ            i                          )              =                            q          i                ⁡                  (                                                    y                i                            ;                              μ                i                                      ,                          σ              i              2                                )                    =                        1                                    y              i                        ⁢                                          2                ⁢                                  πσ                  2                                                                    ⁢        exp        ⁢                  {                      -                                                            (                                                            log                      ⁢                                              xe2x80x83                                            ⁢                                              y                        i                                                              -                                          μ                      i                                                        )                                2                                            2                ⁢                                  σ                  i                  2                                                              }                      ,            y      i        ≥    0.  
In some other embodiments, one or more gi(xc2x7) are gamma densities given by                     g        i            ⁡              (                              y            i                    ;                      θ            i                          )              =                            q          i                ⁡                  (                                                    y                i                            ;                              α                i                                      ,                          β              i                                )                    =                        1                                    Γ                              α                i                                      ⁢                          β              i                              α                i                                                    ⁢                  y          i                                    α              i                        -            1                          ⁢                  exp          ⁡                      (                                          -                                  y                  i                                            /                              β                i                                      )                                ,            y      i        ≥    0.  
In still other embodiments, one or more gi(xc2x7) are Poisson densities given by                     g        i            ⁡              (                              y            i                    ;                      θ            i                          )              =                            θ                      y            i                          ⁢                  exp          ⁡                      (                          -                              θ                i                                      )                                                y          i                !              ,            y      i        =    0    ,  1  ,  2  ,      …    ⁢          xe2x80x83        .  
In some embodiments of this form of the invention, the measurable feature is the intensity of the peak at index li. In other embodiments, the measurable feature is the width of the peak at index li; while in still other embodiments, the measurable feature is a quantification of the skew of the peak at index li.
In yet another form of the invention, various distinct classes of spectra are created, each with pi and qi (and possibly gi(yi; xcex8i) and gi(yi; xcexa9i)) derived from control spectra. The spectrum for a sample is obtained, and a xcex is calculated for the spectrum relative to the model for each class. The sample is associated with one of the distinct classes using a maximum likelihood approach, i.e., the class for which the xcex obtained.
Still another form of the invention is in the field of cluster analysis, wherein the analysis described herein is applied to identify distinct classes or groups in a set of spectra. In doing so, one may embed the comparison technique disclosed herein in cluster analysis techniques such as k-means, hierarchical, leader, and fuzzy clustering methods.
A further form of the invention is in hypothesis testing, with applications in process control (as discussed in detail herein), simple or composite hypothesis testing, analysis of variance, or other statistical procedures involving multiple comparisons.
Still further forms of the invention will occur to one skilled in the art in light of the disclosure herein.