Traditional relational database management systems (DBMSs) have been researched for over thirty years and are used for a wide range of applications. One of their key features is the storage of data as a collection of persistent “relations”, often referred to as tables. A relation is defined as a set of tuples that have the same attributes, each tuple representing a data element and the information about that element. In a DBMS, a table (or relation) is organized into rows and columns. Each row of the table represents a tuple and each column represents an attribute common to all tuples (rows).
Another key feature of a DBMS is a set of well-defined operations (or “queries”) that can be issued by any DBMS client in order to read, write, delete or modify the stored data. Structured Query Language (SQL) is the most widespread query language for this purpose, although it is often enriched with proprietary add-ons.
The conventional DBMS is also characterised by having highly optimized query processing and transaction management components, as illustrated in FIG. 1. A query from a DBMS client 1 is received by the DBMS 2, and is processed by a query parsing unit 3 of the DSMS to parse the query and analyse it in order to verify that it is both syntactically and semantically correct. Once this is done, a query plan is generated by the DBMS's query planner 4. A query plan is a set of step-by-step instructions defining how the query is to be executed, whose details depend on how the concrete DBMS is implemented. The query plan aims to optimise, for example, the number of accesses to the physical storage device 5 (e.g. a hard disk) in order to speed up the execution time. Transaction management secures the so-called “ACID” properties (i.e. “Atomicity, Consistency, Isolation and Durability”).
Queries that are processed by a traditional DBMS are termed “ad hoc” queries. That is, the query is sent to the DBMS and the response to that query, which is both valid at that specific moment and complete, is sent back. Traditional (ad hoc) queries are typically specified in a particular format, optimized, and evaluated once over a “snapshot” of a database; in other words, over a static view of the data in the database. The stored data it which is to be operated on during processing of the query must be stable, i.e. not subject to any other ongoing database transaction since, for example, a high ratio of write queries can harm the performance of the DBMS serving read queries.
In recent years, there has emerged another class of data intensive applications (such as those intended for sensor data processing, network management in telecommunications networks and stock trading) that need to process data at a very high input rate. Moreover, these applications need to process data that is typically received continuously over long periods of time in the form of a data stream. As a result, the amount of data to be processed can be unbounded. In principle, stream data could be processed by a traditional database management system, by loading incoming stream data into persistent relations and repeatedly executing the same ad hoc queries over these relations.
However, there are several problems with this approach. Firstly, the storage of stream data, indexing (as needed) and querying would add considerable delay or latency) in response time, which may not be acceptable to many stream-based applications. At the core of this mismatch is the requirement that data needs to be persisted on a secondary storage device 5, such as a hard disk typically having a high storage capacity and high latency, before it can be accessed and processed by a DBMS 2 implemented in main memory, such as a RAM-based storage device having a lower latency but typically lower storage capacity.
In addition, the above-described “snapshot” approach to evaluating stream data may not always be appropriate since the changes in values over an interval can be important for stream processing applications, for example where the application needs to make a decision based on changes in a monitored temperature. Furthermore, the inability to specify Quality of Service (QoS) requirements for processing a query (such as latency or response time) to a traditional DBMS makes its usage less acceptable for stream-based applications.
It will therefore be appreciated that the characteristics of the conventional DBMS (i.e. the passive role it plays, the need for standardised query formats and associated predefined query plans, stable data, etc.) make the DBMS unsuitable for serving applications that require the processing of huge amounts of data. An example is an application performing Complex Event Processing (CEP) over a stream of data arriving periodically or continuously, from one or a plurality of data sources (e.g. sensors emitting their measured values, servers sending real-time stock rates, etc.), whose number is unpredictable.
Hence, the techniques developed for DBMSs need to be re-examined to meet the requirements of applications that use stream data. This re-examination has given rise to a paradigm shift along with new approaches and extensions to current techniques for query modelling, optimization, and data processing in order to meet the requirements of an increasing number of stream-based applications. Systems that have been developed to process data streams on a real-time basis to meet the needs of stream-based applications are widely known as Data Stream Management Systems (DSMSs).
FIG. 2 shows a DSMS 10 together with a DSMS client 20. Queries for DSMS 10 are also expressed in a standard language similar to SQL (e.g. Continuous Query Language (CQL) and its derivatives) and a query plan is also produced. However, the queries executed in a DSMS are termed “continuous queries” (CQs) and differ from their DBMS counterparts principally by being specified once (commonly via provisioning, e.g. via operation and maintenance interfaces) and then evaluated repeatedly against new data over a specified life span or as long as there is data in the input stream(s) 11. A continuous query can be regarded as a set of one or more logical operations (e.g. filter operations, joint operations etc.) that are applied to data in the input data stream(s). Thus, continuous queries are long-running queries that produce output continuously. The result of executing a CQ is a therefore an output data stream 12, possibly with differing rates and schema as compared to the corresponding input data stream(s). The data items in the input data stream(s) 11 can be regarded as “raw events” while those in the output stream which generally convey more abstract information as a result of the CQ execution, can be regarded as “computed events”.
Accordingly, a DSMS is not required to store in a permanent manner all the data from the input streams (although it might store some the received data in certain cases, at least temporarily, for example whenever historical data is needed). Data is extracted and processed by a DSMS as it is received continuously from the incoming streams (taking the order of data arrival into account), and output streams are produced as a result of the execution of CQs in a substantially continuous manner. Thus, in contrast to the traditional DBMS, a DSMS assumes an active role long as it does not need to receive a (explicit) read query from a database client for sending some data to the client based on the stream data the DSMS currently holds.
Incoming streams 11 to, and outgoing streams 12 from, the DSMS can be regarded as an unbounded sequence of data items that are usually ordered either explicitly by a time-based reference such as a time stamp, or by the values of one or more data elements (e.g. the packet sequence identifier in an IF session). A data item of a data stream can be regarded as a tuple of a relation. In this context, tuples comprise a known sequence of fields and essentially correspond with application-specific information. Hereinafter, the terms “data item” and “tuple” are used interchangeably.
One example of tuples that can be received by a DSMS within incoming data streams is shown in FIG. 3. In this case, a sensor having a unique ID sends, in a continuous manner (e.g. every second), a measure of the temperature, and the humidity and CO levels of its surroundings. This constitutes a stream of data. A large number of sensors (even hundreds of thousands) can feed a DSMS which can produce one or more output data streams based on the received incoming data streams. For example, the CQ execution by a DSMS over incoming data streams comprising tuples as illustrated in FIG. 3 can produce an output data stream for a certain DSMS client application that contains the sensor identity, CO level and time information, only when the monitored temperature exceeds a certain threshold.
A more typical DSMS deployment is illustrated in FIG. 4, where the DSMS 10 receives data from one or more incoming data streams 11, executes a continuous query against the received data and sends at least some of the received data to a plurality of DSMS clients 20-1 to 20-N. In a typical application, the DSMS 10 will process a vast number of input data streams 11 comprising data from e.g. sensor networks (e.g. devices sensing and transmitting values of parameters such as temperature, pressure, humidity, etc.), a telecom operator network or network traffic monitors, among other possibilities.
Each DSMS client applies its own application logic to process the received data stream, and triggers one or more actions when the processing results satisfy predetermined criteria (e.g. the values reported by one or more sensors depart from certain pre-determined ranges, or an average value of a monitored variable exceeds a threshold). An action can comprise sending a message to another application server. For example, the DSMS client may issue an instruction for sending an SMS or activating an alarm, or message towards a certain device to change an operational parameter of the device.
The DSMS 10 and the corresponding client applications 20-1 to 20-N are normally deployed in different nodes. This is done partly for performance reasons, since the assurance mechanisms implemented by the DSMSs (if any), as well as DSMS scheduling policies, would be affected if the DSMS platform also implemented the applications' logic. In this case, the CPU or memory consumption would depend not only on the CQ execution but also on other variables that are unknown or at least difficult to calculate.
The data sources generating the input data streams 11 are push-based, meaning that they are not programmed to provide data on demand or even to store some data until it is requested, but to release it as soon as new data becomes available. The DONS 10 therefore has no direct control over the data arrival rates, which can change in unpredictable ways, getting bursty at times.
The bursty nature of the incoming stream(s) can prevent DSMSs from maintaining the required tuple processing rate whilst data is being received at a high rate. As a result, a large number of unprocessed or partially processed tuples can become backlogged in the system, causing the tuple processing latency to increase and the value of the received stream data to therefore diminish. In other words, the data arrival rates can get so high that the demand on the DONS system resources (such as CPU processing capacity, memory, and/or network bandwidth) may exceed the available capacity. In this case, the DSMS will be overloaded and will not be able to process input tuples as fast as they are received. Thus, when the DSMS is overloaded, data arrives via the input data stream(s) at a higher rate than it can be processed by the DSMS using the processing resources available to it, in order to maintain a QoS required by at least client of the DSMS. Since the DSMS is required to have some capacity for handling occasional busts of data in one or more of the (inherently unpredictable) input data streams, the DSMS can also be regarded as being overloaded (and thus incapable of handling incoming bursts of data) when data arrives via the input data stream(s) at a rate which is more than a certain fraction of the rate at which incoming data can be processed by the DSMS using the processing resources available to it, in order to maintain the required QoS. For example, the DSMS could be considered overloaded when the input data rate is more than e.g. 80%, or more than 90%, of the maximum rate at which it can be processed by the DSMS, although this fraction will depend on the volatility of the data stream sources in any given DSMS application.
Unless the overload problem is resolved, tuples will continue accumulating in queues, latencies will continuously grow, and latency-base QoS will degrade. Due to the predefined QoS requirements of a CQ, query results that violate the QoS requirements may become useless, or even cause major problems as the DSMS client applications could execute wrong or inappropriate actions if they receive outdated data.
Each DSMS is responsible for monitoring data to detect critical situations. Since such overload situations are usually unforeseeable and immediate attention is critical, adapting the system capacity to the increased load by adding more resources may not be feasible or economically meaningful. An alternative way of handling overload situations is therefore required.
One known approach to dealing with such data overload situations and reducing the demand on available resources is so-called “load shedding”. When the DSMS is overloaded with data from the input data stream(s), load shedding as performed, i.e. an least some of the data items (tuples) as received by the DSMS or partially processed by the DSMS are discarded in order to reduce the processing burden of the DSMS in generating its output data stream(s). In other words, load shedding involves selecting which of the tuples should be discarded, and/or in which phase of the CQ execution the tuple(s) should be dropped. The overload may be caused by an excessively high data input rate, as noted above, or any other situation arising that causes a degradation of the QoS required by a DSMS application, such as a degradation of performance conditions within the DSMS.
In any case, certain threshold limits may be predefined in the DSMS which, with regard to data rate from the input data stream(s), can establish that a degradation of its QoS performance for accomplishing with CQ execution can occur and, thus, prejudice the production of the corresponding output data stream(s). Accordingly, a DSMS can activate a “load shedding” mechanism when—among other factors that can cause an overload or malfunction on its resources—the data rate from the input data stream(s) exceeds the configured limit, and deactivate it otherwise.
The discarding of tuples from the system during load shedding preferably minimises an error in the result of the CQ execution. Such discarding of tuples is often acceptable as many stream-based applications can tolerate approximate results. However, load shedding poses a number of problems in DSMS systems.
A random load shedder simply sheds tuples at random. Although this kind of shedder is easy to implement, it has the drawback of failing to discriminate between meaningful tuples and those with no impact on the QoS provided to an application.
A semantic shedder, on the other hand, bases its decision on whether to drop a tuple on the tuple's relevance according to information statically configured in the DSMS. This requires the DSMS to be configured with a relationship between the “value” of a certain received tuple (i.e. as received from incoming streams) and its relevance for a particular client application, which is determined by a corresponding so-called “utility function”. The utility function defines a static relation between the value of a tuple and the corresponding impact on the system QoS figures (latency, CPU, memory etc.) that are imposed by the DSMS client application. The utility function needs to be entered manually into the DSMS by the DSMS administrator. However, the system administrator manually configuring semantic load shedding instructions in a DSMS is required to have a deep knowledge of the client applications that receive data from the DSMS. This can be unfeasible in scenarios comprising a large and/or varying number applications, and furthermore the logic can even change over time.
In order to use the built-in mechanisms provided by the DSMS (if any), it is necessary to express the requirements (e.g. QoS requirements) of the DSMS's client application(s) with regard to the DSMS output stream(s). Specification of an appropriate utility function is a difficult task in many cases.
Firstly, some DSMS products do not include load shedding support as a built-in function. Even if they do, there are usually numerous clients in a typical practical application of a DSMS, with many clients using differing sets of output data streams. Furthermore, the client application logic might not be known when the DSMS is deployed or configured by the administrator, and can be complex and subject to frequent changes. For example, the logic of a client might also depend on data received by the client other than that received via the DSMS output stream (e.g. configuration variables), and vary with time as the client application is repeatedly updated.
In view of the considerable difficulties summarised above, several different approaches have been taken to adapting a DIMS to reliably and consistently deliver improved QoS to a variety of client applications whilst implementing a load shedding process.
One of these approaches, which is particularly applicable to multi-query processing systems executing CQs with different QoS requirements, has been to improve resource allocation by developing effective scheduling strategies. A number of scheduling strategies have been developed, some more useful for catering for the needs of a particular type of application (in terms of tuple latency, total memory requirement etc.) than others. However, the scheduling problem in a DSMS is a very complex one and efforts are ongoing to develop strategies with reduced scheduling overhead.
A further approach is to deploy several DSMS servers or instances in order to evenly distribute the incoming load among them and avoid congestion situations. However, apart from the increased deployment cost, this solution brings about synchronization and/or configuration issues. For example, since an output stream can be a result of a DSMS processing one or more input streams, devices sending input streams towards the DSMS servers should then be arranged every time a DSMS server is added. Moreover, splitting a CQ execution among several nodes is not a straightforward task (since some operators implementing the CQ execution logic might need to store a sequence of tuples) and might affect the overall QoS figures.
Despite these efforts and others, there still remains a great need to provide an improved DSMS which can reliably deliver improved QoS to a variety of client applications while implementing a load shedding process.