Traditional relational database management systems (DBMSs) have been researched for over thirty years and are used for a wide range of applications. One of their key features is the storage of data as a collection of persistent “relations”, often referred to as tables. A relation is defined as a set of tuples that have the same attributes, each tuple comprising an ordered set of one or more data elements. In a DBMS, a table (or relation) is organised into rows and columns. Each row of the table represents a tuple and each column represents an attribute common to all tuples (rows).
Another key feature of a DBMS is a set of well-defined operations (or “queries”) that can be issued by any DBMS client in order to read, write, delete or modify the stored data. Structured Query Language (SQL) is the most widespread query language for this purpose, although it is often enriched with proprietary add-ons.
The conventional DBMS is also characterised by having highly optimised query processing and transaction management components, as illustrated in FIG. 1. A query from a DBMS client 1 is received by the DBMS 2, parsed by a query parsing unit 3 of the DSMS, and analysed in order to verify that it is both syntactically and semantically correct. Once this is done, a query plan is generated by the DBMS's query planner 4. A query plan is a set of step-by-step instructions defining how the query is to be executed, whose details depend on how the concrete DBMS is implemented. The query plan aims to optimise, for example, the number of accesses to the physical storage device 5 (e.g. a hard disk) in order to speed up the execution time. Transaction management secures the so-called “ACID” properties (i.e. “Atomicity, Consistency, Isolation and Durability”).
Queries that are processed by a traditional DBMS are termed “ad hoc” queries. That is, the query is sent to the DBMS and the response to that query, which is both valid at that specific moment and complete, is sent back. Traditional (ad hoc) queries are typically specified in a particular format, optimised, and evaluated once over a “snapshot” of a database; in other words, over a static view of the data in the database. The stored data which is to be operated on during processing of the query must be stable, i.e. not subject to any other ongoing database transaction since, for example, a high ratio of write queries can harm the performance of the DBMS serving read queries.
However, in recent years, there has emerged another class of data intensive applications (such as those intended for sensor data processing, network management in telecommunications networks and stock trading) that need to process data at a very high input rate. Moreover, these applications need to process data that is typically received continuously over long periods of time in the form of a data stream. As a result, the amount of data to be processed can be unbounded. In principle, stream data could be processed by a traditional database management system, by loading incoming stream data into persistent relations and repeatedly executing the same ad hoc queries over these relations.
However, there are several problems with this approach. Firstly, the storage of stream data, indexing (as needed) and querying would add considerable delay (or latency) in response time, which may not be acceptable to many stream-based applications. At the core of this mismatch is the requirement that data needs to be persisted on a secondary storage device 5, such as a hard disk typically having a high storage capacity and high latency, before it can be accessed and processed by a DBMS 2 implemented in main memory, such as a RAM-based storage device having a lower latency but typically lower storage capacity.
In addition, the above-described “snapshot” approach to evaluating stream data may not always be appropriate since the changes in values over an interval can be important for stream processing applications, for example where the application needs to make a decision based on changes in a monitored temperature.
Furthermore, the inability to specify Quality of Service (QoS) requirements for processing a query (such as latency or response time) to a traditional DBMS makes its usage less acceptable for stream-based applications.
It will therefore be appreciated that the characteristics of the conventional DBMS (i.e. the passive role it plays, the need for standardised query formats and associated predefined query plans, stable data, etc.) make the DBMS unsuitable for serving applications that require the processing of huge amounts of data. An example is an application performing Complex Event Processing (CEP) over a stream of data arriving periodically or continuously, from one or a plurality of data sources (e.g. sensors emitting their measured values, servers sending real-time stock rates, etc.), whose number is unpredictable.
Hence, the techniques developed for DBMSs need to be re-examined to meet the requirements of applications that use stream data. This re-examination has given rise to a paradigm shift along with new approaches and extensions to current techniques for query modelling, optimization, and data processing in order to meet the requirements of an increasing number of stream-based applications. Systems that have been developed to process data streams to meet the needs of stream-based applications are widely known as data stream management systems (DSMSs).
FIG. 2 shows a DSMS 10 together with a DSMS client 20. Queries for DSMS 10 are also expressed in a standard language similar to SQL (e.g. Continuous Query Language (CQL) and its derivatives) and a query plan is also produced, by a query parsing/planning unit 6. However, the queries executed in a DSMS are termed “continuous queries” (CQs) and differ from their DBMS counterparts principally by being specified once (commonly via provisioning, e.g. via operation and maintenance interfaces) and then evaluated repeatedly against new data over a specified life span or as long as there is data in the input stream(s) 11.
More specifically, a continuous query can be regarded as a query plan which consists of detailed algorithms for implementing a (typically large) number of relational operators, such as “select”, “project”, “join” and other “aggregation” operators, which are interconnected in a network. These operators act on data elements as they arrive and cannot assume the data stream to be finite. Some operators, for example “select” and “project”, can act on data in a stream in turn to produce an output continuously. On the other hand, other operators, such as “join” and “sort”, naturally operate on complete data sets and will therefore produce no output until the data stream ends, thus acting as “blocking” operators; in order to output results continuously, such blocking operators need to be converted into non-blocking operators, and this is often achieved by employing the concept of a “window” to produce time-varying, finite relations out of a stream.
Thus, the query plan associated with a continuous query is usually a complex entity consisting of a large number of operators, each operator being associated with a memory queue (or buffer) for buffering tuples during bursty input periods (in order not to lose incoming or partially processed data), and often requiring resources (primarily main memory) to hold state information to perform window-based computations successfully. For example, the “symmetric hash join” operator requires hash tables for its two relations for the duration of the window.
Thus, continuous queries are essentially long-running queries that produce output continuously. The input data stream(s) 11 are received by an input adapter 12 and then passed to the continuous query processor 13. The result of executing the continuous queries is output via the output adapter 14 as an output data stream 15, possibly with differing rates and schema as compared to the corresponding input data stream(s). The data items in the input data streams) 11 can be regarded as “raw events” while those in the output streams 15, which generally convey more abstract information as a result of the CQ execution, can be regarded as “computed events”.
Accordingly, a DSMS is not required to store in a permanent manner all the data from the input streams (although it might store some of the received data in certain cases, at least temporarily, for example whenever historical data is needed). Data is extracted and processed by a DSMS as it is received continuously from the incoming streams, and output streams are produced as a result of the execution of CQs in a substantially continuous manner. Thus, in contrast to the traditional DBMS, a DSMS assumes an active role as long as it does not need to receive a (explicit) read query from a database client for sending some data to the client based on the stream data the DSMS currently holds.
Incoming streams 11 to, and outgoing streams 15 from, the DSMS 10 can be regarded as an unbounded sequence of data items that are usually ordered either explicitly by a time-based reference such as a time stamp, or by the values of one or more data elements (e.g. the packet sequence identifier in an IF session). A data item of a data stream can be regarded as a tuple of a relation. In this context, tuples comprise a known sequence of fields and essentially correspond with application-specific information. Hereinafter, the terms “data item” and “tuple” are used interchangeably.
One example of tuples that can be received by a DSMS within incoming data streams is shown in FIG. 3. In this case, a sensor having a unique ID sends, in a continuous manner (e.g. every second), a measure of the temperature, humidity and CO level of its surroundings. This constitutes a stream of data. A large number of sensors (even hundreds of thousands) can feed a DSMS, which can produce one or more output data streams based on the received incoming data streams. For example, the CQ execution by a DSMS over incoming data streams comprising tuples as illustrated in FIG. 3 can produce an output data stream for a certain DSMS client application that contains the sensor identity, CO level and time information, only when the monitored temperature exceeds a certain threshold.
A typical DSMS deployment is illustrated in FIG. 4, where the DSMS receives data from one or more incoming data streams 11, executes a continuous query against the received data and sends at least some of the processing results to a plurality of DSMS clients 20-1 to 20-N. Each DSMS client applies its own application logic to process the received data stream, and triggers one or more actions when the processing results satisfy predetermined criteria (e.g. the values reported by one or more sensors depart from certain pre-determined ranges, or an average value of a monitored variable exceeds a threshold). An action can comprise sending a message to another application server. For example, the DSMS client may issue an instruction for sending an SMS or activating an alarm, or a message towards a certain device to change an operational parameter of the device. The actions taken by the client applications 20-1 to 20-N may have to fulfil strict requirements in terms of latency after a combination of input events.
In almost all practical applications the characteristics of the input streams are unpredictable. According to configured QoS settings, the DSMS deployment shown in FIG. 4 might produce output events even in the case of incomplete or out-of-order sequences of input events, or in cases where the rate of events generated by different inputs 11 is very diverse.
The bursty nature of the incoming stream(s) can prevent DSMSs from producing correlated outputs when the bursts of different inputs are not synchronised. Even in the case of a single input, this can produce a sparse stream. For example, a temperature sensor might store locally a number of temperature readings for transmission in order to save battery power, instead of producing a periodic sequence of data items. High-volume, high-speed data streams may overwhelm the capabilities of the stream processing system.
This circumstance will force a DSMS to wait for the arrival of input data items to process in order to be able to produce an output. This, in turn, might prevent the DSMS from satisfying certain QoS requirements. In the example of the temperature sensor provided above, an application expecting to trigger an alarm when a temperature reading is above a given threshold might produce an outdated alarm, depending on the period at which the temperature sensor sends sets of temperature readings.
Two key parameters for processing continuous, potentially unbounded data streams are: (i) the amount of memory available; and (ii) the processing time required by the query processor, as will now be explained.
(i) Memory is a precious resource and constitutes an important design constraint. As noted above, a DSMS uses the concept of a “window”, which is essentially a time-based or tuple-based buffer in which incoming data items are stored until all the data required have become available. In some cases the defined windows are not wide enough to collect all the information required to build the tuples. In reality, the probability of collecting complete tuples decreases with the number of input streams and the degree to which the frequencies of the data items in the different streams diverge. This may lead to an inability to collect enough information to perform the data analysis, and a consequent potential compromise of the results.(ii) Response time (in other words, the latency introduced by processing incoming data streams and producing the results after executing the corresponding query) is another crucial characteristic that a DSMS tends to manage more effectively than competing technologies (e.g. in-memory databases). When the available resources are limited and time is critical, minimizing the response time is a must.
In many real-world streams, corrections or updates to previously processed data are available only after the fact. Stream sources (such as sensors, a web server, etc.) as well as the communication infrastructure connecting them to the DSMS can be highly volatile and unpredictable. As a result, data may arrive late or out of time, or even go missing during its transmission. In all these cases, applications would need to deal with incomplete input data, and may produce imperfect results unless an alternative mechanism is available.
In some scenarios, for example those involving databases which serve telecommunication networks where the various data streams have a range of data arrival rates, the performance of current data stream analysis systems is unsatisfactory. Waiting for all data to become available introduces some latency into the CQ execution process. For some applications, such as those where response time is more important than high accuracy, a delay in generating the CQ result can be impractical and even risky.
A possible solution to address this problem is to use so-called “sketches” associated with each input data stream. An example of this approach is provided in U.S. Pat. No. 7,483,907 B2. According to this approach, when a data stream is arriving late, a sketch summarising the data stream is used instead. These sketches approximate the underlying streams with reasonable accuracy. Another possible method is to generate histograms that describe the distribution of each data stream.
An example of a DSMS which handles input streams having disparate data arrival rates with the use of sketches will now be described with reference to FIG. 5.
FIG. 5 shows a deployment of a DSMS for controlling the temperature of a computer room so as to avoid an overheating of IT equipment therein and reduce the risk of a fatal fire. In this example, the DSMS 10 receives data from input data streams 11-1, 11-2 and 11-3 via input adapter 12, and generates and outputs two output data streams, 15-1 and 15-2, via output adapter 14 by executing continuous queries CQ1 and CQ2 over the input data streams. The output data streams 15-1 and 15-2 are provided to applications App1 and App2 on an application server or a user terminal 20. In the present example, the output data streams 15-1 and 15-2 may be used by the DSMS client 20 to indicate an emergency situation (e.g. by sending an SMS to a mobile terminal, or activating a bell within a room) and/or to generate commands for operating a cooling device.
The DSMS deployment of the present example comprises a DSMS 10 and three sensors that are provided at appropriate locations inside the room, namely sensor 1, sensor 2 and sensor 3. The DSMS 10 monitors parameters concerning the atmospheric conditions within the room, as received via the input data streams 11-1 to 11-3. In particular, two of the sensors, namely sensor 1 and sensor 2, monitor the temperature inside the room and generate respective data streams comprising the temperature readings, i.e. streams 11-1 and 11-2, respectively. The remaining sensor, sensor 3, records the ambient humidity (expressed as a percentage) and provides the humidity readings to the DSMS via data stream 11-3. The sensors are connected by any suitable means to the DBMS and thus send their respective data streams continuously to the DBMS 10.
The DSMS processes the received information and generates two output data streams. In particular, the DSMS 10 checks whether or not the measured temperature within the room exceeds a certain limit and, if so, generates an output data stream, 15-1, that causes an air conditioning system within the room to be activated (or its thermostat setting to be reduced). However, if the temperature is much higher than the limit, the DSMS client 20 performs the same action and additionally raises an alarm alerting the user or another application to the possibility of a fire in the room being monitored. Another possible action is to increase the speed of the fan responsible for circulating fresh air into the room.
Although the DSMS deployment of the present example involves three input data streams, a typical DSMS will have to execute continuous queries using data items received simultaneously via a substantially higher number of input data streams, and output more than two data streams.
As noted above, the illustrated DBMS analyses the incoming data streams 11-1 to 11-3 by executing continuous queries CQ1 and CQ2 against them. In the present example, the continuous queries are expressed in pseudocode (using no particular formalism) as follows:CQ1: If Avg(Tsensor—1,Tsensor—2)>26° AND Hsensor—3<80% Then Activate CoolingCQ2: If Avg(Tsensor—1,Tsensor—2)>28° AND Hsensor—3<30% Then Raise Alarm
Thus, continuous query CQ1 requires that whenever the average of the temperatures Tsensor—1 and Tsensor—2 recorded by the two temperature sensors is greater than 26°, and the humidity Hsensor—3 inside the room (as recorded by sensor 3) is lower than 80%, a data stream is to be generated for causing the DSMS client 20 to switch ON the cooling system until the average temperature decreases to below 26°.
Continuous query CQ2 requires that whenever the average temperature recorded by the two temperature sensors (i.e. sensor 1 and sensor 2) is higher than 28° and the humidity level measured by sensor 3, i.e. Hsensor—3, is lower than 30%, the DSMS 10 is to generate a data stream which causes the DSMS client 20 to alert a user of a possible fire in the room being monitored.
However, a problem arises when, for example, stream 11-3 provides its data at a lower data rate than streams 11-1 and 11-2. In this case, continuous queries CQ1 and CQ2 cannot be executed before the humidity level readings have become available to the DSMS 10.
Sketching techniques summarise all the tuples as a small number of random variables. Thus, they project the value of an input stream using, for example, random functions. A suitable sketch for this example (considering that the sketch predicts the value of stream 11-3 with a certain level of accuracy) would be the average of the previously seen five values of that stream. It is noted that the selection of five values is only an example, and a more accurate approach would estimate a function using statistical techniques.
Nevertheless, using sketching techniques to generate estimated values of missing data elements in input data streams does not preclude executing the corresponding continuous query/queries afterwards, which entails a substantial amount of data processing by the DSMS.
A further problem is that histograms and random sampling are useful for performing data summarization and selectivity estimation for only one input stream or parameter. For instance, in the example of FIG. 5, the estimated value for a missing data element from the stream 11-3 can be provided by considering an average of previously seen values. However, considering that data stream applications typically monitor multiple input streams and aggregations at the same time, this approach would require using many different types of sketches (one for each stream), and therefore introduces a large overhead. Furthermore, the probability of failing on the estimations increases since multiple items of input data are estimated separately, without considering the rest of the input streams. In the present example, this situation might appear when two streams (for example, stream 11-1 and stream 11-3) are missing and the corresponding sketches available for each of them are used instead.
In addition, sketching methods work well with numerical values. However, in the present example, if data in stream 11-3 (which provides humidity readings) take the form of a text labels (e.g. “very-high”, “high”, “normal”, “low”, “dry”, “very-dry” etc.) then statistical methods based on numerical calculations alone will not be sufficient to predict future values, as extra interpretation logic will be required for the processing semantics of non-numeric values.
Thus, there remains a considerable need (especially in time-critical DSMS applications) to reduce data processing latencies in the DSMS in order to provide a fast response, particularly in instances where one or more values from one or more input data streams are not available for executing the one or more continuous queries.