The invention relates to a method for determining a window size for outlier detection.
Preprocessing of data is an important task prior to any data analysis. In time series data analysis, one part of preprocessing consists of removing outliers from a data set being analyzed. An outlier is a data point or measurement that falls outside of the range of most of the data points or measurements in the data set. Without outlier handling, traditional data analysis may fail because outliers will distort the variance of other data in a data set. For instance, doing a trend analysis requires detection and removal of outliers. Otherwise a trend prediction will become strongly influenced by a small number of outlier data points that are not at all representative for the complete data set.
Outlier detection is usually not difficult for a human. However, it can be difficult for a computer program to automatically detect outliers. Traditional outlier handling is not well adapted to handle data sets generated by typical database monitoring systems. Although data may be aggregated in a so-called data warehouse on a continuing basis, there are times during normal data warehousing operation when large data sets (data blocks) are created on a recurring basis. The occurrence of data blocks, even if anticipated, nevertheless results in a dramatic workload change for a significant period of time.
Data blocks can be generated when a database system suspends the normal task of aggregating input data in order to perform other tasks (such as data consolidation operations, backup operations, overnight batch jobs, etc.) that may be performed infrequently but that result in the creation of data blocks each time they are performed. Tasks of this type may be performed on a regular, recurring basis (for example, daily, weekly, monthly, etc.) or on an as-needed basis.
Although the analysis of such data blocks can place heavy demands on data processing resources, the detection and removal of outliers must still be performed.
Automatic outlier detection involves establishing a window and detecting if outliers exist inside within the window. The main problem is deciding how big the window should be. The present invention fills a need for a flexible and efficient method for determining an appropriate window size for outlier detection as well as a need for an outlier detection method that can handle blocks of data points with extreme values.
The invention may also be implemented as a computer program product for outlier detection for time series in database systems. The computer program product includes a computer usable medium embodying computer usable program code configured to perform a local search for outliers on a sliding window with a window size (w), code configured to maintain a data structure representing the degree to which a value of a measuring point can be an outlier, code configured to measure an uncertainty in the data structure, code configured to optimize the window size by maximizing the uncertainty, and code configured to detect outliers with a given threshold.