1. Field of the Invention
This invention relates to a method of analyzing multivariate data generated by an instrument in order to determine whether abnormal features are present. More particularly, this invention relates to an improved method for rapidly identifying instrumentation or process failures in a chemical system.
2. Description of the Prior Art
On-line analytical instrumentation generates data that is used in a wide variety of applications, such as closed-loop control of a process, quality assurance of a product, or environmental and safety functions. Often, this data is in the form of multivariate data such as absorbance readings at various wavelengths, a detector response at various times, or any other set of data that consists of multiple measured values on each individual sample. The reliability of the data depends largely upon the performance of the instrument used to generate the data. If the instrument fails to work properly, the data generated may contain little if any valid information.
Problems with analytical instruments are often first detected when an individual notices that unusual data is being generated. Unexpected peaks, larger than expected noise levels, and baseline offsets are just a few of the features that may lead the individual to question the validity of a chromatogram or spectrum. Monitoring the data as it is generated for the appearance of these unusual features allows for the detection of developing problems before they become severe enough to affect the ongoing analysis. Individuals monitoring the raw chromatogram or spectrum data perform the largely unconscious activity of learning from experience what a "normal" set of data looks like and then deciding whether the present set of data is reasonably similar. Unfortunately, it is not practical to manually monitor all of the data, as many on-line instruments produce more than a thousand sets of data per day.
Repetitive manual tasks, such as monitoring large amounts of produced data, are generally capable of being automated through the use of computers. Furthermore, computers are widely used to collect the data generated by on-line instruments, and so are readily available to perform routine monitoring. Unlike an analyst, however, computers cannot perform any "unconscious" activity. Accordingly, in order to monitor the data for abnormal features, the computer must first be programmed to identify normal features in a spectrum or chromatogram.
The field of study that deals with teaching the computer to emulate the process of learning and recognizing features in data is called pattern recognition. Pattern recognition techniques are typically used to sort sets of data into groups having similar features. In outlier identification, however, only one group is identified which is defined by the features in a set of data containing only sets of multivariate data which are known to be normal.
Outlier identification is accomplished by first teaching the computer to recognize "normal", "acceptable" or "expected" features in multivariate data known to be normal. When a new spectrum or chromatogram is obtained, its features are compared to what is expected. If the data has additional features, or lacks significant features, it is labelled "abnormal", "unacceptable", or an "outlier". Outliers may be the result of many different causes such as instrument failures, mechanical problems or process problems such as impurities in the analyzed materials. Pattern recognition techniques are able to identify any changes in the appearance of the data, regardless of its source, whereas simpler systems which are programmed to signal the operator whenever certain unwanted values are reached, can only be used to detect foreseen problems. Accordingly, when pattern recognition techniques are used, the potential for abnormalities in the data being undetected is reduced.
Principal Component Analysis (PCA) is one procedure that can be used as a pattern recognition technique. PCA will be used below to illustrate the invention, but it should be understood that the present invention can be used with any technique which can model features in the data (e.g. Partial Least Squares technique--see P. Geladi, and B. R. Kowalski, Analytica Chemica Acta, 185, pg. 1, 1986).
One way of describing how PCA works is to think of PCA reorienting a set of data so that each spectrum or chromatogram becomes a single point in a multidimensional space. The number of measurements which make up the original spectrum or chromatogram defines the number of dimensions in the new coordinate system. A group of calibration chromatograms or spectra which the analyst has determined to be representative of the expected spectra or chromatograms can be placed in this coordinate system forming a cloud of points in the multidimensional space. PCA mathematically describes this cloud of points using as few dimensions (or principal components) as possible. Residual sets of multivariate data (residuals) which identify the portion of each calibration spectrum or chromatogram which was not contained within the model are then calculated. The sum of the squares (SS) of the residuals are then compared with the SS of the residuals obtained for unknown samples to see if the unknown samples are within the proper range.
The current uses of the PCA method for outlier detection are only concerned with the SS of the residual spectra (see Gerd Puchwein and Anton Eibelhuber, "Outlier Detection in Routine Analysis of Agricultural Grain Products by Near-Infrared Spectroscopy", Analytica Chemica Acta, 223, pp. 95-103, 1989). This is analogous to using the square of the Euclidean distance of the residual spectrum from the origin. Theoretically, the PCA model could be constructed to take into account all features of the calibration set. This would result in residual spectra randomly distributed about the origin as the residual spectra would only contain random noise. In this situation the SS of the residual spectra is an appropriate measure of normality. Experience has shown, however, that when more principal components are added to the model in order to describe every feature in the calibration chromatograms or spectra, the model becomes too close a fit of the calibration set data; the model begins to fit the noise in the data (overfitting). When this happens, the unknown samples which should be classified as normal will be classified as outliers because their noise structure will not be identical to the noise structure of the members of the calibration set. Therefore, a better approach would be to use fewer principal components in the modeling, and allow relatively small features to remain in the residual spectra. In this situation the residual spectra are not distributed about the origin, and therefore, the SS of the residual spectra is no longer an appropriate measure of the acceptableness of the spectra. To take into account the location of the residual spectra relative to the origin, the average of the residual spectra is used as a reference point rather than the origin. Consequently, this approach avoids overfitting by reducing the number of principal components and increases the sensitivity for detecting abnormal features or outliers by using the average residual spectrum as a reference point.
Furthermore, the PCA method cited above has been applied only to near infrared spectra. Further difficulties are encountered when expanding the known pattern recognition techniques to chromatographic applications. In chromatographic applications, some features of the data are expected to change over time. For example in gas chromatography, as the column ages, changes in the baseline become more prevalent. If the baseline offset or shape changes, all of the data being produced will be labelled outliers, even though valid peak data is being generated.