Complex computing systems are nowadays oftentimes implemented based on the so-called “event-driven architecture” (EDA), which is a software architecture pattern promoting the production, detection, consumption of and reaction to events. The EDA pattern may be applied in the implementation of applications and systems which transmit events among loosely coupled software components and/or services. The components of an event-driven system may act as event producers and/or event consumers. Consumers have the responsibility of applying a reaction after an event is presented. Building applications and systems around an EDA allows these applications and systems to be constructed in a manner that facilitates more responsiveness, because event-driven systems are, by design, more normalized to unpredictable and asynchronous environments.
Further, it is known to define quality standards to be met by computing systems and their components, respectively, in so-called service level agreements (SLAs). SLAs can be understood as negotiated agreements between two parties, namely the consumer and provider, and are typically enforced by measuring runtime values (so-called key performance indicators (KPIs)) and comparing the measured values with values specified in the SLAs. SLAs may thus be used to define technical performance requirements to be met by the distributed computing components, thereby defining operational requirements which have to be met to ensure a proper operation of the underlying computing system.
In this context, those skilled in the art will appreciate that such a monitoring of a computing system, e.g. by means of detecting the violation and/or fulfillment of SLAs is a difficult task given the vast amount of interacting computing components in real-life systems.
One technique for the monitoring and processing of events is commonly known under the term complex event processing (CEP). Generally, event processing is a method of tracking and analyzing (processing) streams of information (data) about things that happen and deriving a conclusion from them. Complex event processing (CEP) is event processing that combines data from multiple sources to infer events or patterns that suggest more complicated circumstances. The goal of CEP is to identify meaningful events (such as opportunities or threats) and respond to them as quickly as possible.
CEP introduces several important paradigm changes with respect to classical data processing technology, where the data is relatively static and various queries can be formulated in order to retrieve the desired data. In CEP, however, the queries are comparatively fixed and fed with continuously arriving streams of data (input events). CEP queries (also referred to as “continuous queries”) typically correlate multiple input data items, look for patterns and produce output events in the form of alerts, error messages when a certain pattern is observed.
In summary, CEP applications have to deal with highly transient event data that arrives continuously at very high rates and have to produce the corresponding output events/alerts as soon as possible, ideally nearly in real-time. This includes push-based processing (also known as data driven processing) within main memory, which denotes a data flow approach where data to process is not requested (pulled) by the processing operator on demand or using certain scheduling techniques, but is directly provided (pushed) to the processing operator when it becomes available. The U.S. Pat. No. 7,676,461 B2 and U.S. Pat. No. 7,457,728 B2 provide further background information about complex event processing (CEP).
In the prior art, two traditional monitoring solutions have been proposed, namely local monitoring and distributed monitoring. In local monitoring the monitoring system runs on a local machine with the system to be monitored (cf. FIG. 1, upper half). Contrary, in distributed monitoring only local agents are integrated in the system to be monitored, but the monitoring system itself is separated from the system to be monitored (cf. FIG. 1, lower half). In this case, the monitoring system can take many distributed systems into account, wherein the agents typically do the local processing and the monitoring system polls for information from these agents (cf. the arrows “check( )” and “reply( )” in FIG. 1, lower half). Common providers of such traditional monitoring systems are for example CA Technologies, BMC, HP Software (Open View) and IBM Tivoli.
However, the two above summarized prior art solutions have several disadvantages. For example, it is required to install and run the monitoring system (which monitors the system to be monitored) as a separate component. In particular in the context of complex systems to be monitored with a huge amount of computing components, this creates a vast amount of additional workload, e.g. due to the need for constant polling for status information by the monitoring system, which could distort the behavior of the system to be monitored, thereby leading to false monitoring results. Furthermore, the monitoring system requires detailed knowledge about the system to be monitored, e.g. logic to check for the existence of processes, to properly call the component(s) to be monitored, to collect status information from reply messages, to properly analyze log files, and for the interpretation of information obtained from the system to be monitored.
Another disadvantage is that these monitoring solutions do not give any information about query responsiveness and cannot take event-specific semantics into account, except they implement their own event-based application and monitor the behavior. In this case, from the perspective of the event processing engine it is a normal client application. For example, for querying event streams where events and query processing support a clear semantics on time, it is important to have the events in a clearly defined order. That is, once an event e1 with timestamp t1 has been processed, an event e2 with timestamp t2<t1 cannot be processed anymore, since from the viewpoint of e2 the event e1 would have been already processed in the future and it can no longer be ensured that the processing of e2 solely uses events that did not occur after e2 has occurred. For solving such “out of sequence” scenarios, two approaches have been proposed in the prior art: Ignoring e2 and optionally writing an error message into a log file, or defining a delay time and buffering events until the delay time has elapsed, so that events arriving with a timestamp t2 after t1 can be processed if t1−t2 is smaller than the delay time. However, the key problems remain in these approaches: (a) The delay time only gives a buffer but still, with reduced probability, events can appear too late; and (b) The delay-time introduces latency for the query processing.
An exemplary use case for the above explained scenario is depicted in FIG. 2, which shows an event-based system in a logistics environment. Trucks carrying goods from one place to another emit events about their current status and position (e.g. by means of a GPS sensor and a wireless connection). Because of the distributed nature of this system, an event e1 from one truck (cf. “Truck no. 1” in FIG. 2) might get delayed, e.g. due to network latency, and a later event e2 from another truck (cf. “Truck no. 2” in FIG. 2) might reach the event processing engine before the event e1 of truck no. 1. If a query makes use of both events e1 and e2, the latter event does not fit into the time sequence, i.e. it arrives out of sequence. FIG. 3 illustrates a similar use case where the event-based system is used to monitor and drive a business process. In this example, “Process Step 1” in the main process forks two independent sub processes with “Step 2.1” and “Step 2.2”, which can be executed in parallel. Similar to the use case of FIG. 2, events coming from one step can overtake events from the other step, leading again to “out of sequence” situations.
An approach for monitoring Quality of Service (QoS) aspects in event-based systems is described in “Quality of Service in Event-based Systems” of Appel et al. (cf. http://ceur-ws.org/Vol-581/gvd2010_4_3.pdf). The document proposes a QoS architecture (which is shown in FIG. 4 of the document) with a variety of layers assigned to respective quality categories in the context of a Message-Oriented Middleware (MOM) providing publish/subscribe capabilities for QoS functionalities. The events regarded in this document comprise quality data (cf. FIG. 5 of the document). Each event producer or consumer publishes events and statistical information, while the event transporting middleware itself can publish data accounting for QoS. The document discloses that the middleware can also act as a message consumer, however, the messages referred to in the document are limited to QoS information, rather than using the regular output of an event processing system to evaluate the correct functioning of the system.
Further, international patent application publication WO 2013/016246 A1 provides general background information for event-based monitoring systems. The document discloses a network monitoring and testing system comprising an event processing system. The event processing system is responsible for processing data events using a KPI configuration and for generating KPI data events.
In summary, the prior art approaches for monitoring complex distributed computing systems run the risk of errors, such as failures in transmission (e.g. incorrect or missing status information) or installation set ups (e.g. missing computing components or overload of components/interactions) of the monitoring system. These errors might in turn lead to an inefficient and/or incorrect monitoring of the computing components of the system to be monitored. SLA violations or other types of violations might be detected too late and comprise errors. As explained above, a further important disadvantage is the lacking capability of prior art systems to deal with event-based semantics.
It is therefore the technical problem underlying certain example embodiments to provide an event-based monitoring system which facilitates a more efficient yet reliable monitoring, thereby at least partly overcoming the above explained disadvantages of the prior art.
This problem is according to one aspect of certain example embodiments solved by an event-based system adapted for self-monitoring. In the embodiment of claim 1, the event-based system comprises:    a. a complex event processing, CEP, engine, adapted for consuming and producing events in accordance with at least one continuous query;    b. wherein the CEP engine comprises a first continuous query, adapted for producing events of a first event type and for consuming the events of the first event type;    c. wherein the CEP engine is adapted for detecting performance issues based on the first continuous query.
Accordingly, the embodiment defines a system for runtime governance of event-based systems, i.e. for monitoring the behavior of a system in order to detect performance issues. The event-based system comprises a CEP engine which is capable of executing any number of continuous queries deployed thereto. As it is known in the art, the CEP engine to this end consumes events (preferably received via an event bus), executes one or more continuous queries thereon, and produces one or more output events (which are preferably again published on the event bus). As it is known in the prior art, continuous queries deployed to the CEP engine are used for the processing of events, wherein various types of queries can be defined depending on the application requirements of the event-based system.
Importantly, the event-based system of certain example embodiments is adapted for self-monitoring, i.e. the embodiment is based on the concept that the monitoring system is at the same time the system to be monitored. To this end, the CEP engine comprises a first continuous query which produces events of a first event type, wherein these events are also consumed by this first continuous query. This creates a feedback loop, which is the basis for providing insight into the operational behavior of the event-based system, for ensuring that certain quality measures are met and/or counter actions can be taken when these quality measures are likely to run out of acceptable values.
Contrary to the prior art approaches described further above, the inventive system thereby avoids the need for implementing the monitoring logic in external components and thus the need for constant status polling. As a result, the system of certain example embodiments does not require any external monitoring system. The feedback loop established by means of the first continuous query serves as the basis for monitoring and detecting a variety of different types of performance issues, where event roundtrip times exceed certain thresholds, and where the event-based system suddenly becomes inoperable e.g. due to a crash of the system. These types of error situations and their timely detection in the event-based system of certain example embodiments will be described in more detail further below.
In one aspect of certain example embodiments, events of the first event type indicate a creation time of the event, a sequence number, a host identifier of the CEP engine and/or a process identifier of the CEP engine. Accordingly, the events propagated through the feedback loop established by the first continuous query comprise data that allows to establish an ordered sequence of so-called “ping” events due to the contained creation time and sequence number. Furthermore, host identifiers and/or process identifiers of the CEP engine executing the first continuous query might be taken into account to make sure that the first continuous query deployed to the CEP engine only consumes events which it produced itself.
In another aspect of certain example embodiments, the first continuous query is adapted for consuming an event of the first event type, and for producing an event of the first event type indicating a subsequent sequence number. Accordingly, the events of the first event type are constantly running through the feedback loop. When an event enters the CEP engine with a certain sequence number, the subsequently produced event is assigned with a subsequent sequence number to guarantee the correct ordering of the events of the first event type (i.e. the “ping” events).
Preferably, the event of the first event type indicating a subsequent sequence number is produced only after a pre-defined waiting time has elapsed. Accordingly, next events with a subsequent sequence number might be generated only after a pre-defined waiting time, the value of which may be stored in a configuration file. Accordingly, it is possible to define the pace at which events for establishing the feedback loop are created in order not to overload the event-processing system. The waiting time can be adjusted as desired, e.g. whenever a particularly timely detection of performance issues is desired, the waiting time between two feedback loop events can be decreased. Contrary, whenever the system is busy and possibly inefficiently running due to overload, the waiting time can be increased for the sake of better performance, but at the cost of less timely detection of performance issues.
In yet another aspect of certain example embodiments, the CEP engine comprises a second continuous query, adapted for consuming an event of the first event type, for determining a round-trip time of the event, and for producing an event of a second event type indicating a performance issue, if the round-trip time exceeds a pre-defined threshold value. Accordingly, in this example aspect, a further (second) continuous query is deployed to the CEP engine. This second continuous query consumes the “ping” events of the first event type, measures the round trip-time of the event, and whenever the measured round-trip time exceeds a pre-defined threshold value, e.g. defined in a configuration file, an event of a second event type is produced which indicates a violation of the round-trip time. Events of the second event type may indicate the time of detection of the performance issue, the determined round-trip time, a host identifier of the CEP engine and/or a process identifier of the CEP engine.
In a further aspect of certain example embodiments, the CEP engine is adapted for producing an event of a third event type indicating a performance issue, if an event is received out of sequence by the CEP engine. Preferably, the CEP engine is to this end adapted for determining that an event is received out of sequence by comparing a time stamp of the event to a time stamp of a preceding event. Accordingly, the event-based system of certain example embodiments is in this aspect also capable of detecting events that arrive out of sequence. Generally, a sequence of events can be considered as the pre-defined ordering of incoming and/or outgoing events e1, e2, . . . , en. The ordering of events can be defined via a timestamp and/or sequence number for each event (t1, t2, . . . , tn). In other words, whenever an event is received by the CEP engine with a wrong timestamp and/or sequence number, the CEP engine produces an event of the third event type which indicates a violation of the sequence order. This is of importance in order for the events to be processed correctly. It should be mentioned that for detecting sequencing issues based on the sequence number, the event emitters need to be synchronized in order to impart a valid sequence number to the events. If this type of violation is not detected or detected too late, an event e2 depending on an event e1 might be processed incorrectly, thereby leading to errors in the event-based system.
Preferably, events of the third event type indicate the time of detection of the performance issue, the event type of the event that was received out of sequence, event data, a host identifier of the CEP engine and/or a process identifier of the CEP engine.
The event-based system may further comprise a monitoring client, adapted for consuming events of the first event type, and for detecting that the CEP engine is out of operation. Accordingly, it might be of further interest to detect that the event-based system is no longer operating or shutdown e.g. due to a power breakdown. Therefore, a monitoring client might be employed which consumes events of the first event type (i.e. “ping” events) and detects when the event-based system is out of operation (e.g. when the monitoring client did not receive a “ping” event after a pre-defined waiting time).
In another aspect of certain example embodiments, the event-based system is further adapted for measuring the processing time of processing a received event, and for allocating additional processing resources if the measured processing time exceeds a pre-defined threshold value. Accordingly, the event-based system can flexibly adapt its processing resources, in case it is noticed that it takes too long to process the events.
In yet another aspect, events of the first event type are produced only after a pre-defined waiting time has elapsed, and the monitoring system is adapted for automatically adjusting the pre-defined waiting time. The adjusting of the pre-defined waiting time may be based on performance statistics of the CEP engine, such as the number of queries presently executing. Accordingly, the waiting times employed in the present system can be flexibly and automatically adjusted e.g. by calculating performance statistics.
Certain example embodiments relate to a method for self-monitoring an event-based system by operating the event-based system in accordance with any of the above-described aspects. Certain example embodiments provide a computer program comprising instructions for implementing the above-described method.