Outlier detection is an important and challenging aspect of data mining. An outlier is an anomaly—an observation that deviates in some computable aspect from other observations in a random sample of a population. The underlying causes of outliers can be human error or fraudulent behavior, defective instruments, changes in behavior of systems or system malfunction. Outlier detection is a critical task in many safety critical environments as the mere existence of outliers indicates abnormal running conditions from which significant performance degradation may result.
Applications such as fraud detection, network flow monitoring, telecommunications data management etc. generate unbounded data streams, unlike the related data found in traditional databases. An unbound data stream is an ordered sequence of data X∞=(x1, x2, . . . , x∞). As the data arrival is continuous, storing of all data would be extremely difficult, very impractical, and associated with huge storage management costs. Traditional data mining methods are often very theoretical and cannot effectively or efficiently be applied to streaming data as these methods are intended for applications and environments where a finite data set is stored in a local memory, and where each item in the dataset is available for repeated reading and processing. Additionally, applied to unbounded data streams, most methods are computationally expensive and time-consuming.
Further, due to the dynamic nature of e.g. human behavior and activities, property characteristics of a data stream of subscriber data change over time. Because of this, what may have been considered an outlier at one time may be a perfectly coherent observation after a certain time frame. Therefore, methods that rely on one evaluation per item as soon as it is read are not useful here.
In “Efficient Clustering-Based Outlier Detection Algorithm for Dynamic Data Stream” (CORM) (Manzoor Elahi, Kun Li, Wasif Nisar, Xinjie Lv, Hongan Wang, fskd, vol. 5, pp. 298-304, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery, 2008), the authors address the dynamic and unbounded properties of streaming data. In the CORM method, the data stream is divided in L chunks of n data and the chunks are then clustered in k clusters. L, n and k are required as input from the analyst. The analyst also has to define a distance function criterion and the location of the initial k cluster centers.
For every cluster, its outliers, its actual mean value and its updated mean value are saved for the next number of chunks. If the distance function between an object from chunk z and the closest cluster center is larger than the updated mean, the object is carried forward and clustered again with the next chunk z+1. The “safe region” grows as the updated mean radius grows. Data still outside of the updated mean radius when L chunks have been read are declared as outliers.
This method, as all clustering methods, is computationally intensive. Further it requires more than one pass over the data. The CORM method also requires considerable intermittent data storage, since two different mean values for each cluster, the number of clusters and all candidate outliers must be kept available.
Once the presence of outliers has been established, it is usually desirable to cleanse the data stream from these aberrations, which undetected may lead to incorrect results. One cause of under-detection of outliers is that the chosen detection method assumes conditions that do not apply to the population in question. Most methods, such as Z-score and Inter Quartile Range are parametric methods that assume a normal distribution. Used on a population with e.g. a heavy-tailed distribution, or a population which is a mixture of two sub-distributions, outliers, especially intermediate outliers, are likely to remain undetected. Telecom related data, such as charging data or other subscriber related data is among that group, following some power law distribution, and/or consisting of a mixed population.
Moreover, parametric methods are often unsuitable for those large data sets that are typically handled by Consumer Information Management Systems (CIM), which receives input values like Customer Data Record (CDR) flows. Typically, telecom data sets are huge—one single day of charging system data sums up to approximately 40 GB, which means that the memory requirements are excessive. This also means that when the data is finally assembled in the memory, processing time and efforts remain.
Hence there is a need for a method and an arrangement that address or diminish the problems mentioned above.