Complex event processing (CEP) applications typically handle transient event data arriving at very high rates. A CEP engine continuously analyzes incoming event streams by means of filtering, aggregation, correlation, etc., to thereby deliver business relevant patterns on the fly.
The event streams consumed by CEP applications typically do not have a stable and constant behavior over time. Indeed, the event streams may temporarily deviate from expected behavior. The deviations in the event streams may be either an opportunity or a threat in the corresponding CEP application. Accordingly, the early detection of deviations in event streams may be advantageous (e.g., of high value) for a business tied into or otherwise somehow relying or depending in some way on the event streams.
Because of event stream processing requirements, it will be appreciated that it would be desirable to provide well-founded analysis results based on detections of deviations in an event stream. Similarly, it also will be appreciated that it would be desirable to derive such results in an online (e.g., real-time, in-flight, non-stored, etc.) manner. In certain instances, the characteristics of the stream(s) being processed may not be known in advance, the detection may not have prior knowledge on the characteristics of the stream available, etc.
Currently, there are various conventional techniques for detecting irregularities in a given set of data.
One conventional approach is to use a database to explore characteristics and irregularities of a data set in a database. Using a programming language such as, for example, SQL, or with data mining algorithms on top of the database, the data set and its corresponding features can be analyzed. Unfortunately, however, a database approach may not always be feasible in the context of processing a high-volume, low-latency event stream. In such cases, the data typically arrives faster than the database system can process and answer queries. Further, a data mining approach may not be feasible because of typically high computational requirements. Data mining algorithms may require multiple runs over the data, which is typically not possible in a CEP scenario, where the event streams are potentially unbounded and continuously stream in.
CEP applications generally impose rigid processing requirements like a single pass over the stream or limited computational resources. Thus, CEP engines typically process incoming events incrementally. Usually, CEP engines follow a SQL, rule-based, or state-based approach, typically extended by temporal clauses. Those clauses may allow the event stream analysis to be restricted or limited to a temporal window. For example, this allows computing the maximum price of a stock over the previous 10 minutes or other time intervals. Thus, depending on the setting of the time window, a user can place an emphasis on analyzing more recent data.
CEP engines that are SQL-based may utilize specific SQL functionality for deviation detection. SQL provides aggregates including, for example, MIN, MAX, VAR, AVG, etc. To detect a deviation, a continuous SQL query could, for example, compute the deviation of the current value from the Bollinger bands. The Bollinger band may define an envelope of two standard deviations around the average. If a new value is outside the bands, it is classified as a deviation. However, this approach has limitations, as it requires a normal distribution of the data to produce reliable analysis results.
Further, the standard SQL aggregates are of a limited expressiveness. For instance, they only provide empirical summary measures of the underlying distribution, but cannot be readily used to detect irregularities or multiple modes of the distribution. The approach using the Bollinger bands described above, for example, assumes a normal distribution of the event stream, an assumption that may not hold for arbitrary streams. As a consequence, the results for a non-normal distribution may be of low quality as they also may not include irregularities or multiple modes of the stream distribution.
CEP engines based on rules or states are also likely to provide simple aggregates like the above-mentioned SQL functionality. The rules or states may then also use those aggregates to detect deviations from the average behavior. Thus, with CEP engines based on rules or states, the problems associated with a SQL-based CEP engine may still apply to rule/state based engines. Standard aggregates typically only provide summary measures, which do not detect irregularities of the distribution or multiple modes.
Another relatively new technique that may hold promise is stream mining. Stream mining analyzes event streams in an online manner. However, work in this area is in its infancy and more work is needed.
Thus, it will be appreciated that there is a need in the art for improved systems and/or methods for detecting event stream deviation that is, for example, provided to a CEP application or the like.
One aspect of certain example embodiments relates to calculating the deviation in an event stream over at least two windows of time. In certain example embodiments, one of the time periods may encompass the complete event stream.
Another aspect of certain example embodiments relates to estimating deviations in event streams through the use of kernel density estimators (KDEs).
Another aspect of certain example embodiments relates to a notification being sent when a deviation in an event stream occurs.
Yet another aspect of certain example embodiments relates to calculating a deviation between an ideal behavior of an event stream and a short-term calculation of the event stream behavior.
Yet another aspect of certain example embodiments relates to calculating a deviation between an ideal behavior of an event stream and a long-term calculation of the event stream behavior.
Still another aspect of certain example embodiments relates to comparing a deviation of a long-term time window to a deviation in a short-term time window.
Still another aspect relates to comparing a deviation of an event stream with a threshold value.
In certain example embodiments, a deviation detection method for use with a processing system including at least one processor is provided. At least one stream of event data is received at the processing system, with the event data including at least one attribute. A long-term statistic corresponding to a first estimate of a probability density function (PDF) of at least one monitored attribute in the at least one stream of event data over a first time window is calculated. A short-term statistic corresponding to a second estimate of the PDF of the at least one monitored attribute in the at least one stream of event data over a second time window is calculated, with the second time window being of a shorter duration than the first time window. First and second distances between an ideal PDF and the long- and short-term statistics, respectively, are computed. A current deviation is computed based at least in part on the first and second distances. The current deviation is compared to a threshold value. The above is repeated as further monitored events are delivered by the at least one stream of event data.
In certain example embodiments, a deviation detection method for use with a processing system including at least one processor is provided. At least one stream of event data is received at the processing system. A short-term kernel density estimator (KDE) is maintained, over a first time period, for at least one monitored event in the at least one stream of event data. A long-term KDE is maintained, over a second time period, for the at least one monitored event in the at least one stream of event data. A deviation from at least one predefined probability density function (PDF) is calculated in dependence on the short- and long-term KDEs. The deviation is compared to a threshold to detect an event stream deviation.
There also are provided in certain example embodiments non-transitory computer readable storage mediums tangibly storing instructions that, when processed by at least one processor, execute the above-described and/or other methods.
Similarly, there also are provided in certain example embodiments systems that include adapters configured to receive at least one stream of event data and processors configured to execute the above-described and/or other methods. Data stores may be provided in certain example implementations to log information about detected deviations. Such information may include, for example, the time/date of the deviation, the expected value or range of values, the observed value or range of values, etc.
These aspects and example embodiments may be used separately and/or applied in various combinations to achieve yet further embodiments.