Nowadays, modern computing systems have to deal with an ever-increasing volume of data. Complex Event Processing (CEP) is a processing paradigm designed to cope with such increasing volumes of data. CEP aims at processing and analyzing streams of data as the data is coming in, so that opportunities or threats can be detected and appropriate actions can be triggered fast. CEP systems use continuous queries to analyze the streams in a real-time manner, derive insights continuously, and forward these insights directly to the corresponding consumer(s). For example, by analyzing streams of credit card transactions in a real-time manner, potential fraud attempts can be discovered directly and the corresponding credit card can be disabled to avoid further damage. Other application scenarios that can benefit from CEP are for example in logistics, surveillance systems, algorithmic trading, web applications, and manufacturing systems.
A CEP system can be connected to data sources that continuously send data usually equipped with temporal information, so-called events. An event is for example an airplane landing, the blocking of a credit card transaction, a temperature reading from a machine, etc. These events stream into a CEP system and are analyzed by continuous queries. Such a query continuously processes incoming events, following a push-based processing paradigm, and corresponding results are directly pushed to follow-up consumers. One type of CEP systems uses SQL-based CEP engines, i.e. continuous queries are described in a SQL dialect. These engines typically resemble the mechanics of a database system: A textual query is translated into a combination of logical operations and for each of those logical operations a suitable physical implementation is chosen and then activated. Due to the sharing of sub-queries, the entirety of currently running queries constitutes an operator graph where the nodes refer to operators hosting the physical implementations and the edges to the flow of events between operators. Corresponding operators are e.g. a filter operator, a join operator, or an aggregation operator. In that context, the terms query graph and operator graph are used synonymously hereinafter. FIG. 1 illustrates an exemplary operator graph.
An important aspect of continuous queries is the handling of time as first-class citizen. Typically, a continuous query is equipped with a sliding time window, to which the current results refer. For example, compute the average transaction amount in the last hour. To compute these results, relevant events are temporarily stored in internal main memory data structures. If the input rates are high and the time window is large, these internal data structures can allocate large amounts of memory.
Another important aspect is the computational complexity of an operator, i.e. how much time does the actual processing of an event take. This latency can also be directly influenced by the size of the internal data structures, e.g. a theta-join has to traverse all elements in the data structure, which can be time-consuming for large status structures.
Typically, a CEP system executes a multitude of real-time analyses in parallel over transient streams of incoming data. Due to the brittle characteristics of the input streams, the long-running nature of the analyses, and the need for immediate analysis results, a CEP system is very demanding with respect to system resources such as memory, CPU, and bandwidth. The allocation of CPU and bandwidth resources mainly depends on the characteristics of the input streams and the computational complexity of the queries. The allocation of main memory mainly depends on the reference timeframe of the real-time analyses, e.g. the longer the time window of the query, the more data has to be typically kept in main memory. Thus, changing stream characteristics or queries entering/leaving the system has a high impact on the execution of the system and its resource allocation.
Due to the high security and business relevance of the analyses run by a CEP system, the system execution has to be robust and stable. Therefore, an elaborate governance technique for a CEP system which ensures a stable system execution by monitoring and adjusting it, is of utmost importance. In this context, the term “governance” refers to actions for handling performance issues of the CEP system, comprising monitoring the CEP system in order to detect performance issues during its runtime, adjusting and/or stabilizing the CEP system in order to resolve detected performance issues. The adjustment and stabilizing actions preferably encompass actions for handling performance issues which have already occurred, as well as performance issues which are likely to occur in the future. Such a technique has to be highly adaptive and scalable in order to adapt quickly to recent changes of the CEP system's workload. Besides the computation of suitable monitoring metrics a vital aspect is the presentation of the governance status to the user. This visualization has to allow for a simple yet comprehensive presentation of the system status, so that the user can quickly tackle critical queries. Another vital aspect is a robust adaptation of the system load which ensures that the system remains operational also under heavy load by suitably adapting the current query workload.
In the field of database technology, it is known to provide monitoring capabilities to observe the database status and to detect performance issues. This relates typically to monitoring system statistics, monitoring top SQL statements, monitoring current database sessions etc. This also typically includes the visualization of the acquired monitoring information. There are a multitude of tools for monitoring database systems available today. Nevertheless, as queries usually have a short runtime, analyzing query execution during query runtime is uncommon. Thus, the monitoring tools available for database management systems rely on a fundamentally different processing approach than CEP. A database system is designed for processing ad-hoc queries, which traverse a persistent data set and return all entries that fulfill the query criteria. In that context the response time of such a one-time query is the key metric for a monitoring component. By contrast, the monitoring of CEP systems has completely different requirements. In CEP, queries stay in the system and continuously produce results while transient events are streaming in and out. In that context the throughput of the query, the memory allocation of the internal data structures, and the latency are key metrics. Therefore, known monitoring approaches for database systems are hardly usable in the field of CEP.
A number of CEP engines are nowadays available on the market, including products of Software AG (Apama), StreamBase, ruleCore, IBM, TIBCO, SAP/Sybase/Coral8/Aleri, UC 4 Senactive, WestGlobal Vantify, Event Zero, Active Insight, Pion CEP, Esper/EsperTech, Red Hat Drools Fusion, Oracle, Microsoft Streamlnsight, Informatica, StarView, OMD Onetick CEP and Vitria M3O. Furthermore, Optimize for Infrastructure is a product of applicant designed to monitor IT products with a focus on webMethods products. It provides a set of preconfigured KPIs which are monitored and analyzed. In case of statistically significant deviations from normal KPI behavior alerts are raised.
The document “Comprehensive QoS Monitoring of Web Services and Event-Based SLA Violation Detection” of Michlmayr et al. (MW4SOC 2009) evaluates QoS monitoring of web services and the detection of SLA violations. Event processing technology is used to detect corresponding SLA violations and send notifications to consumers.
The document “Reaktives Cloud Monitoring mit Complex Event Processing” of HoBbach et al. (Datenbankspektrum (2012) 12) discusses a reactive monitoring of cloud environments with Complex Event Processing technologies.
The document “Dynamic Metadata Management for Scalable Stream Processing Systems” of Cammert et al. (SSPS 2007) describes a system for metadata management of stream processing systems, the academic term for CEP systems. Metadata are in this context particularly monitoring metrics such as the input rate of an operator. The document primarily focuses on the architectural integration of sensors that acquire metadata from operator nodes within a query graph, and also discusses metadata dependencies and metadata update concepts.
The document “HOLMES: An event-driven solution to monitor data centers through continuous queries and machine learning” of Teixeira et al. (DEBS 2010) addresses the monitoring of data centers by combining an Event-Driven Architecture, Complex Event Processing, and a specific unsupervised machine learning algorithm. User-defined rules are continuously checked for known problems. Anomalous patterns are computed by a machine learning algorithm that gets data normalized by a CEP engine as input.
The document “Predictive Complex Event Processing: A Conceptual Framework for Combining Complex Event Processing and Predictive Analytics” of Fiilop et al. (BCI 2012) discusses a conceptual framework combining Complex Event Processing and predictive analytics.
The document “Application-Level Performance Monitoring of Cloud Services Based on the Complex Event Processing Paradigm” of Leitner et al. (SOCA 2012) proposes to use Complex Event Processing to specify and monitor high-level performance metrics of applications. In the cloud context an existing cloud middleware is extended by event-based monitoring facilities. Corresponding components in the system emit status events which are then processed by a CEP engine to derive monitoring metrics. The main use is to enable expressive scheduling policies for the applications.
The document “Information System Monitoring and Notifications Using Complex Event Processing” of Nguyen et al. uses CEP in the context of information system monitoring and notifications. The main context is the monitoring of enterprise information systems.
U.S. Pat. No. 7,826,990 B2 discusses real-time monitoring and predictive analytics for an electrical system. A data acquisition component retrieves real-time measurements from the electrical system while a virtual system modeling engine predicts data outputs. The virtual system model is calibrated and synchronized with the real-time data to maintain an up-to-date model of the system and its sensors. An analytics engine checks for differences of real-time and predicted data output. Depending on the difference either an alert is raised or the system is re-calibrated.
U.S. patent application publications No. 2011/0283239 and 2011/0283144 concentrate on the visual analysis and debugging of CEP queries. An Event Flow Debugger is introduced that consists of multiple analysis modules that allow the debugging of a CEP query. An associated analysis UI displays the results of those analysis steps and allows for user interaction.
European patent application 2 560 106 of applicant focuses on the integration of forecasting functionality in the SQL interface of a CEP system.
European patent application 13169119.8 of applicant discusses the self-monitoring of a CEP system. It uses a feedback loop to detect several performance issues and error situations. The basic functionality is implemented by means of continuous SQL queries.
U.S. patent application publication No. 2012/0110599 of applicant discusses Quality of Service with respect to event processing. The event processing system prioritizes the processing of queries and/or events having assigned a QoS boundary like maximum reaction time or priority. The system processing is adapted so that the boundary conditions are met while at the same time increasing the processing rate.
However, none of the prior art has proposed a governance approach that addresses or solves the challenging requirements in the field of monitoring of CEP systems. Overall, the governance of a CEP system (i.e. both the detection of occurred or likely to occur performance issues as well as the initiating of corrective measures) as the key component of CEP system governance has to handle the following exemplary metrics for an operator/query: input rate, output rate, CPU utilization, latency, and allocated memory. As a CEP system is designed for high-volume, low latency application scenarios, a corresponding monitoring component has to deal with the following requirements: high volumes of incoming events per second, varying stream characteristics including sudden load peaks, varying workload in terms of input streams and queries entering/leaving the system and varying numbers of clients connecting to/disconnecting from the system. Also, a governance component for CEP systems should allow for a sophisticated real-time analysis of system status information and present the results in an intuitive manner to the user. Additionally, the system should react quickly to critical or potentially critical situations by asking for user input or taking autonomously corrective actions.
It is therefore the technical problem to provide a technique for handling performance issues of CEP systems which is fast, reliable and flexibly adaptable to the challenging demands of CEP systems, thereby at least partly overcoming the above explained disadvantages of the prior art.
This problem is according to one aspect of the disclosure solved by a system for handling performance issues of a production Complex Event Processing, CEP, system during runtime, wherein the production CEP system comprises at least one event source, at least one continuous query and at least one event sink. In the embodiment of claim 1, the system comprises:    a. at least one monitoring sensor adapted for producing a stream of status events relating to the production CEP system; and    b. a monitoring CEP system adapted for executing at least one continuous analysis query on the stream of status events to produce a stream of monitoring events, wherein the stream of monitoring events indicates performance issues of the production CEP system relating to the throughput, the latency, and/or the memory consumption of the production CEP system.
Accordingly, the system of this embodiment is based on the concept of governing a CEP system during its execution (i.e. the “production CEP system”) by means of CEP technology itself, thereby enabling to take advantage of the powerful capabilities of CEP for the handling of performance issues. To this end, the production CEP system is monitored by a second CEP system, namely the monitoring CEP system. Status information relating to the production CEP system is collected by monitoring sensors and fed into the monitoring CEP system. For example, the at least one monitoring sensor may be attached to an operator of the at least one continuous query of the production CEP system and is adapted for counting input and/or output events of the operator and/or for computing a memory consumption of the operator. It will be appreciated that the production CEP system can be configured to perform any sort of processing on collected sensor information, such as e.g. a CEP system operating in a logistics, manufacturing or surveillance system, a CEP system for detecting credit cards fraud attempts, or the like.
Using the collected status information, the monitoring CEP system is enabled to perform complex monitoring analyses using continuous analysis queries in order to detect performance issues of the production CEP system fast, i.e. nearly in real-time, as the production CEP system executes. Using a monitoring CEP system for monitoring the production CEP system has further advantages, e.g. that additional analysis queries may be added to the monitoring CEP system, so that the monitoring can be flexibly adapted to changed circumstances.
In a further aspect of the disclosure, the system may further comprise an analytics component adapted for analyzing the stream of monitoring events using stream mining and for generating at least one statistical model of the performance of the production CEP system. The system may also comprise a statistical model database adapted for storing the at least one statistical model generated by the analytics component, wherein the analytics component may then be further adapted for deriving a forecast of the status of the production CEP system based on the at least one stored statistical model and a current statistical model of the production CEP system. Accordingly, using stream mining techniques to derive statistical models representing the current, past and/or future forecasted status of the production CEP system allows for sophisticated analyses of the production CEP system's performance, as well as its probable future behaviour.
In another aspect of the disclosure, the system further comprises a graphical user interface (also referred to as “graph visualizer” hereinafter) adapted for indicating at least one identified performance issue of the production CEP system. The graphical user interface may be adapted for displaying the at least one continuous query of the production CEP system as an operator graph, wherein operators of the operator graph involving a performance issue are indicated. Accordingly, an operator or administrator of the production CEP system is enabled to obtain a comprehensive overview of the status of the production CEP system, which is the basis to take corrective actions in case of performance issues in a fast and reliable manner.
In yet another aspect, the system further comprises a system stabilization component adapted for indicating to a user a recommended action for resolving an identified performance issue of the production CEP system. Accordingly, the system of certain example embodiments might not only indicate the status of the production CEP system and possible performance issues, but also recommend actions for solving such performance issues. The recommended actions for resolving an identified performance issue of the production CEP system might include e.g. stopping the at least one continuous query of the production CEP system, moving the at least one continuous query of the production CEP system to another processing component, and/or modifying the at least one continuous query of the production CEP system. Modifying the at least one continuous query of the production CEP system may comprise reducing a window size of the at least one continuous query, reducing an output rate of the at least one continuous query, and/or removing event attributes not used by the at least one continuous query.
In addition or alternatively, the system stabilization component may be adapted for automatically initiating an action for resolving an identified performance issue of the production CEP system. Actions for resolving an identified performance issue of the production CEP system may be e.g. rejecting new input streams, continuous queries and/or query consumers of the production CEP system, executing a query optimizer, sorting a plurality of continuous queries of the production CEP system by memory consumption and stopping queries and/or moving queries to another processing component until memory consumption is in a reasonable range, and/or sorting input streams of the production CEP system by input rate and sorting a plurality of continuous queries of the production CEP system by output rate and stopping queries and/or moving queries to another processing component until bandwidth consumption is in a reasonable range.
Certain example embodiments also provide a method for handling performance issues of a production Complex Event Processing, CEP, system during runtime, wherein the production CEP system comprises at least one event source, at least one continuous query and at least one event sink, wherein the method comprises the following steps: producing a stream of status events relating to the production CEP system by at least one monitoring sensor; and executing, by a monitoring CEP system, at least one continuous analysis query on the stream of status events to produce a stream of monitoring events, wherein the stream of monitoring events indicates performance issues of the production CEP system relating to the throughput, the latency, and/or the memory consumption of the production CEP system.
Further advantageous modifications of embodiments of this method of are defined in further dependent claims.
Lastly, a computer program is provided comprising instructions for implementing any of the above described methods. The computer program may be stored to a non-transitory computer readable storage medium or the like and, when executed, may perform those and/or other instructions.