In recent years, advances in hardware technology have made it possible to collect large amounts of data in many applications. Typically, a database processing this data is affected by continuous activity over long periods of time, thereby allowing such a database to grow without limit. Examples of such data include supermarket data, multimedia data and telecommunication applications. The volume of data may easily reach millions on a daily basis, and it is often not possible to store it so that standard algorithmic techniques may be applied. Therefore, algorithms designed for such data must take into account the fact that it is not possible to revisit any part of the voluminous data, and that only a single scan of the data is allowed during processing. Data of this type is commonly referred to as a data stream.
Unlike a traditional data source, a stream is a continuous process which requires simultaneous model construction and abnormality reporting. Therefore, it is necessary for a supervision process to work with whatever information is currently available, and to continually update an abnormality detection model as new abnormalities occur.
Considerable research has been conducted in the field of data streams in recent years, see, for example, J. Feigenbaum et al., “Testing and Spot-Checking of Data Streams,” ACM SODA Conference, 2000; J. Fong et al., “An Approximate Lp-Difference Algorithm for Massive Data Streams,” Annual Symposium on Theoretical Aspects in Computer Science, 2000; C. Cortes et al., “Hancock: A Language for Extracting Signatures from Data Streams,” ACM SIGKDD Conference, 2000; S. Guha et al., “Clustering Data Streams,” IEEE FOCS Conference, 2000; and B-K. Yi et al., “Online Data Mining for Co-Evolving Time Sequences,” ICDE Conference, 2000.
Many traditional data mining problems, such as clustering and classification, have recently been re-examined in the context of the data stream environment, see, for example, C. C. Aggarwal et al., “A Framework for Clustering Evolving Data Streams,” VLDB Conference, 2003; P. Domingos et al., “Mining High-Speed Data Streams,” ACM SIGKDD Conference, 2000; and S. Guha et al., “Clustering Data Streams,” IEEE FOCS Conference, 2000.
Abnormality detection is an important problem in the data mining community, see, for example, H. Branding et al., “Rules in an Open System: The Reach Rule System,” First Workshop of Rules in Database Systems, 1993; M. Berndtsson et al., “Issues in Active Real-Time Databases,” Active and Real-Time Databases, pp. 142-157, 1995; T. Lane et al., “An Application of Machine Learning to Anomaly Detection,” Proceedings of the 20th National Information Systems Security Conference, pp. 366-380, 1997; and W. Lee et al., “Learning Patterns from Unix Process Execution Traces for Intrusion Detection,” AAAI Workshop: AI Approaches to Fraud Detection and Risk Management, pp. 50-56, July 1997. However, these models do not address the prediction of rare abnormalities in the presence of many spurious, but similar, patterns.
For example, in stock market monitoring applications, it may be desirable to find patterns in trading activity which are indicative of a possible stock market crash. The stream of data available may correspond to the real time data available on the exchange. While a stock sell-off may be a relatively frequent occurrence, which has similar effects on the data stream, one may wish to have the ability to quickly distinguish the rare crash from a simple sell-off. It may also be desirable to detect particular patterns of trading activity which result in the sell-off of a particular stock, or a particular sector of stocks. A quick detection of such abnormalities is of great value to financial institutions.
In business activity monitoring applications, it may be desirable to find particular aspects of the stream indicative of significant abnormalities in business activity. For example, certain sets of actions of competitor companies may point to the probable occurrence of significant abnormalities in the business. When such abnormalities do occur, it is important to be able to detect them very quickly, so that appropriate measures may be taken.
In medical applications, continuous streams of data from hospitals or pharmacies can be used to detect any abnormal disease outbreaks or biological attacks. Certain diseases caused by biological attacks are often difficult to distinguish from other background diseases. However, it is essential to be able to make such distinguishing judgments in real time in order to create a credible abnormality detection system.
Abnormalities such as disease outbreaks or stock market crashes may occur rarely over long periods of time. However, the value of abnormality detection is highly dependent on the latency of the detection. Most abnormality detection systems are usually coupled with time-critical response mechanisms. Furthermore, because of efficiency considerations, it is possible to examine a data point only once throughout the entire computation. This creates an additional constraint on how abnormality detection algorithms may be designed.
In most situations, data streams may show abnormal behavior for a wide variety of reasons. It is important for an abnormality detection model to be specific in its ability to focus and learn a rare abnormality of a particular type. Furthermore, spurious abnormalities may be significantly more frequent than the rare abnormalities of interest. Such a situation makes the abnormality detection problem even more difficult, since it increases the probability of a false detection.
In many cases, even though multiple kinds of anomalous abnormalities may have similar effects on the individual dimensions, the relevant abnormality of interest may only be distinguished by its relative effect on these dimensions. Therefore, an abnormality detection model needs to be able to quantify such differences.
Since a data stream is likely to change over time, not all features remain equally important for the abnormality detection process. While some features may be more valuable to the detection of an abnormality in a given time period, this characteristic may vary with time. It is important to be able to modify the abnormality detection model appropriately with the evolution of the data stream.