1. Field of the Invention
The invention is related to the field of data processing, and in particular, to classifying features in time series data.
2. Statement of the Problem
The analysis of times series data plays a fundamental role in science and engineering. An important analysis step is the identification and classification of various features in the data. Quality control can be viewed as a subclass of general feature identification and classification, for example, differentiating between a true signal and a contaminating signal. Many algorithms exist for the quality control of time series data, such as Fourier or wavelet analysis, as well as robust and standard statistics. However, for other classification problems, image processing techniques have been used to great advantage. Human analysts are adept at feature identification and classification, nevertheless in many applications it is desired to have an automated algorithm that performs this role.
In time series data, the image that the analyst considers is simply a plot of the time series. Subconsciously, the analyst identifies clusters of points, correlation structures, and also uses a prioi knowledge related to the structure of features in the data. Further transformations and subsequent images of the data are often useful in performing these tasks, such as plotting on different scales and creating histograms and correlation scatter plots. Additionally, the analyst tends to think of data quality in terms of a probability, i.e. the level to which a datum is good or bad. Another important technique the analyst uses is a combination of local and global analyses. For instance, an isolated outlier in the data is easily detected by the analyst looking on a local scale. However, for numerous consecutive outliers, the analyst must consider the data over a larger scale to identify the sequence as outliers.
Typical outlier detection and quality control algorithms are Boolean in nature. That is, they indicate that a data point is either good or bad. Data points that are very bad are grouped with data points that fall just below the xe2x80x9cgoodxe2x80x9d threshold. Furthermore typical outlier detection and quality control algorithms tend to use strong a priori assumptions, and usually rely on a single test or method.
Most time series analysis methods perform on either a local or global scale. For instance, the running median is an example of a local algorithm over the scale of the median window, whereas typical histogram methods use the data over a longer time scale. FIGS. 1 and 2 illustrate how an algorithm can work well on one time scale but fail on another. FIG. 1 shows actual time series data where the instrument was failing, the top plot shows the data coded by a confidence index (high confidence to low confidence correlates respectively to circle, square, triangle, and cross). The confidence in this case was calculated using statistics from a global histogram. Notice that the data in the primary mode is given a high confidence value (circles), while the excursions from the main mode are assigned low confidence values (cross). This algorithm does a good job of flagging the most egregious outliers, but at the same time, valid peaks in the data are given low confidence values. Of course, these peaks can be given higher confidence values by changing parameters in the algorithm, however, this change would also raise the confidence of some of the outliers.
The lower plot in FIG. 1 shows the same data overlaid with a 30 point running median line. The running median does an excellent job of eliminating the outliers in the center right of the plot, however it fails for the xe2x80x9cdropoutsxe2x80x9d in the left hand side. This results from xe2x80x9csaturationxe2x80x9d of the filter, i.e., when over half the window length of data are outliers.
FIG. 2 illustrates two sequences of data which have identical distributions. The upper left hand plot is simply a sigmoid function with small uniform fluctuations. The upper right hand plot is a histogram of this data. The lower left hand plot shows the data from upper left hand plot re-ordered in a random manner. Suppose a global histogram method was used on these two examples. The algorithm would correctly identify many of the points the points in the lower left hand plot as outliers, however, for the data in the upper left hand plot, many of the points would incorrectly be identified as outliers.
The National Center for Atmospheric Research (NCAR) is developing a terrain-induced wind turbulence and wind shear warning system for the aviation community in Juneau, Ak. As part of this system, pairs of anemometers are located on nearby peaks and around the runways which measure the wind every second. For operational purposes, a requirement is to produce reliable one minute averaged wind speeds, wind speed variances, wind speed peak values, and average wind directions. Since these values are updated every minute, it is possible to perform extensive calculations on the data. In general, the anemometers are highly reliable, however there are cases where the sensors make erroneous measurements. Since the mountain-top sensors are sometimes inaccessible, it is important to differentiate between good and bad data even when an instrument is failing. For example, the strong winds encountered in Juneau have been known to vibrate and then loosen the nuts holding the anemometers in place. An example data set from an anemometer exhibiting this problem is shown in FIG. 3. The actual wind speed as measured by the anemometer varies around the range of about 17 m/s. The horizontal axis is time in seconds. Data xe2x80x9cdropoutsxe2x80x9d caused by the mechanical failure can be seen intermittently in the data, centered near 1 m/s. FIG. 4 is data for the same time interval from a second anemometer in close proximity (3 meters) to the first. As can be seen from the plots, the data dropouts are not present in FIG. 4, hence the dropouts are an artifact of a mechanical failure and not caused by turbulent structures in the wind.
Other failure modes can be caused by icing of the anemometer or shielding from certain wind directions by ice build-up. Furthermore, it is known from video footage that certain wind frequencies excite normal modes of the wind direction head and can cause the device to spin uncontrollably. Data from such a case can be seen in FIG. 5 where the vertical axis is wind direction measured in a clockwise direction from North. The horizontal axis is again time measured in seconds. Between about 500 seconds and 1000 seconds the wind direction measuring device is spinning and the data becomes essentially a random sample of a uniform distribution between about 50 degrees to 360 degrees. The true wind direction is seen as an intermittent data at about 225 degrees, which is in general agreement with the value from the nearby anemometer. FIG. 6 shows the wind direction at another time distinct from that in FIG. 5, where in this example, the true wind direction is around 40 degrees. Notice the suspicious streaks in the time series data near 200 degrees.
In the context of these anemometer examples, the crux of the quality control problem is to determine which data points are xe2x80x9cbadxe2x80x9d (not part of the atmospheric data) and which data points are xe2x80x9cgoodxe2x80x9d (part of the atmospheric data). Separating the good data from the bad can be especially difficult when some bad data points have characteristics of good points. For example, during an episode of highly changing, gusty winds there may be sensor problems that manifest in a way that are similar to valid wind gusts, such as some of the dropout data in FIG. 3. Consequently the problem is to identify the suspect data without mislabeling similar looking good data.
Time series algorithms such as Auto-Regressive Moving Average (ARMA) may be used to remove isolated outliers in stationary data. Data are used to compute model coefficients and variance estimates, if the point in question is a large distance from the model prediction in terms of the estimated variance, such a point may be called an outlier. A similar technique is the least square adaptive polynomial algorithm (LSAP) or discounted least squares. For data containing more than isolated outliers, it is necessary to use so-called robust techniques to compute the model parameters. This is because numerous outliers may cause a large error in the parameter estimates and an ARMA method for finding outliers could break down. These robust techniques are much less sensitive to numerous outliers in the data. However even robust methods have what are called breakdown points. For example, if a running median is applied to the data, and more than 50% of the data are outliers this robust technique could fail. There are other robust techniques, but if a long string of data contains only outliers, for instance when a sensor fails, even a sophisticated technique may fail. Since there are cases in the Juneau data where the assumptions inherent in the aforementioned techniques are violated, a new method is required to correctly quality control these time series data.
A powerful tool for this integration of indicators is fuzzy logic. When creating a fuzzy logic algorithm, the developer must determine what characteristics and rules a human expert might use to categorize the data. These characteristics, or indicators, which are either calculated or measured directly from the data are input fields for the membership functions. The membership functions return a membership value, in fact the membership functions rescale the input fields to a common sale so they can be combined effectively by the fuzzy rules. The fuzzy rules are a set of conditional statements that assign a final output value to a fuzzy algorithm given a certain set of input values. Suppose that a fuzzy logic algorithm requires two inputs A and B. A fuzzy rule for this hypothetical algorithm could be: xe2x80x9cwhen membership value A is large and membership value B is small then the output is large.xe2x80x9d Additionally there are other methods that can be used to combine the membership.
A similar method to that outlined above, the NCAR Improved Moments Algorithm (NIMA), has been used to find the atmospheric signal in Doppler wind profiler spectra. A wind profiler is a vertically pointing radar that measures Doppler spectra as a function of range. The spectra indicates the distribution of returned power (vertical axis) as a function of Doppler velocity (horizontal axis). These spectra can be plotted (in log scale) one atop another as shown in FIG. 7. This representation of the data is referred to as a stacked spectra or waterfall plot. The first spectral plot is shown in the bottom left and continues as a function of range up the left column then starts again at the bottom of the right column and continues to the top of that column. Notice the bimodal signal starting at 1127 meters and continuing through 2062 meters. The signal near zero velocity is from a contaminant (ground clutter from nearby mountains) and the signal centered around +6 m/s is the atmospheric signal. FIG. 8 is a contour plot of the stacked spectra or profiler map, (the contour lines represent the log magnitude of the spectra). It is often difficult to grasp the structure of the total signal by looking at the stacked spectra. On the other hand, the profiler contour map more readily reveals the essential visual characteristics of the data to the human analyst. While the data is identical in the stacked spectra and the profiler map, it is clear that the method chosen to render the data is important. It is important to note that the NIMA algorithm was tried on the time series data. However, as with any algorithm, many assumptions were made about the behavior of the data in the development of NIMA, that are contrary to the typical behavior of time series data.
Suppose the data from FIG. 3 is broken into overlapping sub regions using a sequence of running windows. For each data window, an estimate of the probability density function (i.e. a normalized histogram) is calculated. This sequence of histograms can be stacked (FIG. 9) as was done for the profiler spectra. Where the histograms for the first time window is shown in the bottom left, the plots then run up the left column as a function of time and continue from the bottom right plot to the top right plot (the stacked histograms are shown for only the time range that includes the first 555 data points from FIG. 3). Notice that, in this case the mode associated with the atmospheric data (on the right-hand side) and the data associated with the dropouts (on the left-hand side) are well-separated. A more natural way to plot the stacked histograms might be to plot them across the page, that is, as a function of time (imagine turning FIG. 8 on its side).
These stacked histograms can then be plotted as a contour image (FIG. 10), and is called the histogram field. The contour plot in FIG. 10 represents a hypersurface, where the contour lines represent the height of the hypersurface above each point in the time-wind speed plane. As in the case of the profiler, plotting the stacked histograms as a contour image emphasizes the structure inherent in the stacked histograms, i.e. shows the local continuity in the dataxe2x80x94as expected for most time series.
It is natural for a human analyst to look at FIG. 10 and see that there are large clumps (peak regions). Notice though that these clumps do not contain all the data points in the original time series, i.e., there is cluster data and non-cluster data. By inspection the analyst can easily combine these local clusters into larger scale features. For instance in FIG. 10, a human expert might group the large clusters centered around 17 m/s into a feature and the others near 1 m/s into a second feature.
The invention helps solve the above problems by using image processing technology to classify features in time series data. In atmospheric examples of the in, the features can be used to detect outliers in the time series data from weather measurement systems. The invention may also be implemented in numerous other areas, such as image recognition and computer-generated video.
Examples of the invention include systems, methods, and software products to classify a feature in time series data. The systems include a processing system and an interface where the interface receives the time series data. The method is for operating a processing system. The software product includes a storage system that stores application software that directs a processing system.
In some examples of the invention, the processing system is configured to: 1) process the time series data with a plurality of membership functions to generate a plurality of hypersurfaces, 2) process the hypersurfaces to generate a composite hypersurface, 3) process the composite hypersurface to identify clusters, and 4) process the clusters to classify the feature.
In some examples of the invention, the processing system is configured to contour the composite hypersurface to form the clusters.
In some examples of the invention, the processing system is configured to classify the clusters based on a plurality of cluster types, such as an atmospheric cluster type and/or a failure mode cluster type.
In some examples of the invention, the processing system is configured to construct the feature from the clusters based on the cluster classifications.
In some examples of the invention, the processing system is configured to calculate feature membership values for the time series data based on the classified feature and to detect outliers in the time series data based the feature membership values.
In some examples of the invention, the hypersurfaces and/or the composite hypersurface have a height scale from zero to one.
In some examples of the invention, one of the hypersurfaces indicates confidence values for the time series data.
In some examples of the invention, the processing system is configured to process one of the hypersurfaces to identify additional ones of the clusters.
In some examples of the invention, the processing system is configured to: 1) process the time series data with a membership function to generate a hypersurface, 2) process the hypersurface to identify a cluster, and 3) process the cluster to classify the feature.