Traditional relational database management systems (DBMSs) have been researched for over thirty years and are used for a wide range of applications. One of their key features is the storage of data as a collection of persistent “relations”, often referred to as tables. A relation is defined as a set of tuples that have the same attributes, each tuple representing a data element and the information about that element. In a DBMS, a table (or relation) is organized into rows and columns. Each row of the table represents a tuple and each column represents an attribute common to all tuples (rows).
Another key feature of a DBMS is a set of well-defined operations (or “queries”) that can be issued by any DBMS client in order to read, write, delete or modify the stored data. Structured Query Language (SQL) is the most widespread query language for this purpose, although it is often enriched with proprietary add-ons.
The conventional DBMS is also characterised by having highly optimized query processing and transaction management components, as illustrated in FIG. 1. A query from a DBMS client 1 is received by the DBMS 2, parsed by a query parsing unit 3 of the DSMS, and analyzed in order to verify that it is both syntactically and semantically correct. Once this is done, a query plan is generated by the DBMS's query planner 4. A query plan is a set of step-by-step instructions defining how the query is to be executed, whose details depend on how the concrete DBMS is implemented. The query plan aims to optimise, for example, the number of accesses to the physical storage device 5 (e.g. a hard disk) in order to speed up the execution time. Transaction management secures the so-called “ACID” properties (i.e. “Atomicity, Consistency, Isolation and Durability”).
Queries that are processed by a traditional DBMS are termed “ad hoc” queries. That is, the query is sent to the DBMS and the response to that query, which is both valid at that specific moment and complete, is sent back. Traditional (ad hoc) queries are typically specified in a particular format, optimized, and evaluated once over a “snapshot” of a database; in other words, over a static view of the data in the database. The stored data which is to be operated on during processing of the query must be stable, i.e. not subject to any other ongoing database transaction since, for example, a high ratio of write queries can harm the performance of the DBMS serving read queries.
However, in recent years, there has emerged another class of data intensive applications (such as those intended for sensor data processing, network management in telecommunications networks and stock trading) that need to process data at a very high input rate. Moreover, these applications need to process data that is typically received continuously over long periods of time in the form of a data stream. As a result, the amount of data to be processed can be unbounded. In principle, stream data could be processed by a traditional database management system, by loading incoming stream data into persistent relations and repeatedly executing the same ad hoc queries over these relations.
However, there are several problems with this approach. Firstly, the storage of stream data, indexing (as needed) and querying would add considerable delay (or latency) in response time, which may not be acceptable to many stream-based applications. At the core of this mismatch is the requirement that data needs to be persisted on a secondary storage device 5, such as a hard disk typically having a high storage capacity and high latency, before it can be accessed and processed by a DBMS 2 implemented in main memory, such as a RAM-based storage device having a lower latency but typically lower storage capacity.
In addition, the above-described “snapshot” approach to evaluating stream data may not always be appropriate since the changes in values over an interval can be important for stream processing applications, for example where the application needs to make a decision based on changes in a monitored temperature.
Furthermore, the inability to specify Quality of Service (QoS) requirements for processing a query (such as latency or response time) to a traditional DBMS makes its usage less acceptable for stream-based applications.
It will therefore be appreciated that the characteristics of the conventional DBMS (i.e. the passive role it plays, the need for standardised query formats and associated predefined query plans, stable data, etc.) make the DBMS unsuitable for serving applications that require the processing of huge amounts of data. An example is an application performing Complex Event Processing (CEP) over a stream of data arriving periodically or continuously, from one or a plurality of data sources (e.g. sensors emitting their measured values, servers sending real-time stock rates, etc.), whose number is unpredictable.
Hence, the techniques developed for DBMSs need to be re-examined to meet the requirements of applications that use stream data. This re-examination has given rise to a paradigm shift along with new approaches and extensions to current techniques for query modelling, optimization, and data processing in order to meet the requirements of an increasing number of stream-based applications. Systems that have been developed to process data streams to meet the needs of stream-based applications are widely known as Data Stream Management Systems (DSMSs).
FIG. 2 shows a DSMS 10 together with a DSMS client 20. Queries for DSMS 10 are also expressed in a standard language similar to SQL (e.g. Continuous Query Language (CQL) and its derivatives) and a query plan is also produced. However, the queries executed in a DSMS are termed “continuous queries” (CQs) and differ from their DBMS counterparts principally by being specified once (commonly via provisioning, e.g. via operation and maintenance interfaces) and then evaluated repeatedly against new data over a specified life span or as long as there is data in the input stream(s) 11. Thus, continuous queries are long-running queries that produce output continuously. The result of executing a CQ is a therefore an output data stream 12, possibly with differing rates and schema as compared to the corresponding input data stream(s). The data items in the input data stream(s) 11 can be regarded as “raw events” while those in the output stream 12, which generally convey more abstract information as a result of the CQ execution, can be regarded as “computed events”.
Accordingly, a DSMS is not required to store in a permanent manner all the data from the input streams (although it might store some of the received data in certain cases, at least temporarily, for example whenever historical data is needed). Data is extracted and processed by a DSMS as it is received continuously from the incoming streams, and output streams are produced as a result of the execution of CQs in a substantially continuous manner. Thus, in contrast to the traditional DBMS, a DSMS assumes an active role as long as it does not need to receive a (explicit) read query from a database client for sending some data to the client based on the stream data the DSMS currently holds.
Incoming streams 11 to, and outgoing streams 12 from, the DSMS can be regarded as an unbounded sequence of data items that are usually ordered either explicitly by a time-based reference such as a time stamp, or by the values of one or more data elements (e.g. the packet sequence identifier in an IP session). A data item of a data stream can be regarded as a tuple of a relation. In this context, tuples comprise a known sequence of fields and essentially correspond with application-specific information. Hereinafter, the terms “data item” and “tuple” are used interchangeably.
One example of tuples that can be received by a DSMS within incoming data streams is shown in FIG. 3. In this case, a sensor having a unique ID sends, in a continuous manner (e.g. every second), a measure of the temperature, humidity and CO level of its surroundings. This constitutes a stream of data. A large number of sensors (even hundreds of thousands) can feed a DSMS which can produce one or more output data streams based on the received incoming data streams. For example, the CQ execution by a DSMS over incoming data streams comprising tuples as illustrated in FIG. 3 can produce an output data stream for a certain DSMS client application that contains the sensor identity, CO level and time information, only when the monitored temperature exceeds a certain threshold.
A more typical DSMS deployment is illustrated in FIG. 4, where the DSMS 10 receives data from one or more incoming data streams 11, executes a continuous query against the received data and sends at least some of the received data to a plurality of DSMS clients 20-1 to 20-N. Each DSMS client applies its own application logic to process the received data stream, and triggers one or more actions when the processing results satisfy predetermined criteria (e.g. the values reported by one or more sensors depart from certain pre-determined ranges, or an average value of a monitored variable exceeds a threshold). An action can comprise sending a message to another application server. For example, the DSMS client may issue an instruction for sending an SMS or activating an alarm, or a message towards a certain device to change an operational parameter of the device.
The DSMS 10 and the corresponding client applications 20-1 to 20-N are normally deployed in different nodes. This is done partly for performance reasons, since the QoS assurance mechanisms implemented by the DSMSs (if any), as well as DSMS scheduling policies, would be affected if the DSMS platform also implemented the applications' logic. In this case, the CPU or memory consumption would depend not only on the CQ execution but also on other variables that are unknown or at least difficult to calculate. Another reason for deploying the DSMS and client applications in different nodes is the tendency for commercial DSMSs to be optimized for particular hardware platforms, which are not necessarily optimal for deploying the client applications.
The bursty nature of the incoming stream(s) can prevent DSMSs from maintaining the required tuple processing rate whilst data is received at a high input rate. As a result, a large number of unprocessed or partially processed tuples can become backlogged in the system, causing the tuple processing latency to increase without bound. Due to the predefined QoS requirements of a CQ, query results that violate the QoS requirements may become useless, or even cause major problems as the DSMS client applications could execute wrong or inappropriate actions if they receive outdated data.
One known approach to dealing with such data overload situations is so-called “load shedding”. When the DSMS is overloaded with data from the input data stream(s), load shedding is performed, i.e. at least some of the data items (tuples) as received by the DSMS or partially processed by the DSMS are discarded in order to reduce the processing burden of the DSMS in generating the output data stream. In other words, load shedding involves selecting which of the tuples should be discarded, and/or in which phase of the CQ execution the tuple(s) should be dropped. The overload may be caused by an excessively high data input rate or any other situation arising that causes a degradation of the QoS required by a DSMS application, such as a degradation of performance conditions within the DSMS.
In any case, there can be threshold limits that can be predefined in the DSMS which, with regard to data rate from the input data stream(s), can establish that a degradation on its Quality of Service performance for accomplishing with CQ execution can occur and, thus, prejudice the production of the corresponding output data stream(s). Accordingly, a DSMS can activate a “load shedding” mechanism when—among other factors that can cause an overload or malfunction on its resources—the data rate from the input data stream(s) exceeds a configured limit, and deactivate it otherwise.
The discarding of tuples from the system during load shedding preferably minimises an error in the result of the CQ execution. Such discarding of tuples is often acceptable as many stream-based applications can tolerate approximate results. However, load shedding poses a number of problems in DSMS systems.
A random load shedder simply sheds tuples at random. Although this kind of shedder is easy to implement, it has the drawback of failing to discriminate between meaningful tuples and those with no impact on the QoS provided to an application.
A semantic shedder, on the other hand, bases its decision on whether to drop a tuple on the tuple's relevance. This requires the DSMS to be configured with a relationship between the “value” of a certain received tuple (i.e. as received from incoming streams) and its relevance for a particular client application, which is determined by a corresponding so-called “utility function”. The utility function is a relation between the value of a tuple and the corresponding impact on the system QoS figures (latency, CPU, memory etc.) that are imposed by the DSMS client application. The utility function needs to be entered manually into the DSMS by the DSMS administrator.
In order to use the built-in mechanisms provided by the DSMS (if any), it is necessary to express the requirements (e.g. QoS requirements) of the DSMS's client application(s) with regard to the DSMS output stream(s). Specification of an appropriate utility function is a difficult task in many cases.
Firstly, some DSMS products do not include load shedding support as a built-in function. Even if they do, there are usually numerous clients in a typical practical application of a DSMS, with many clients using differing sets of output data streams. Furthermore, the client application logic might not be known when the DSMS is deployed or configured by the administrator, and can be complex and subject to frequent changes. For example, the logic of a client might also depend on data received by the client other than that received via the DSMS output stream (e.g. configuration variables), and vary with time as the client application is repeatedly updated.
In view of the considerable difficulties summarised above, several different approaches have been taken to adapting a DSMS to reliably and consistently deliver improved QoS to a variety of client applications whilst implementing a load shedding process.
One of these approaches, which is particularly applicable to multi-query processing systems executing CQs with different QoS requirements, has been to improve resource allocation by developing effective scheduling strategies. A number of scheduling strategies have been developed, some more useful for catering for the needs of a particular type of application (in terms of tuple latency, total memory requirement etc.) than others. However, the scheduling problem in a DSMS is a very complex one and efforts are ongoing to develop strategies with reduced scheduling overhead.
A further approach is to deploy several DSMS servers in order to evenly distribute the incoming load among them and avoid congestion situations. However, apart from the increased deployment cost, this solution brings about synchronization and/or configuration issues. For example, since an output stream can be a result of a DSMS processing one or more input streams, devices sending input streams towards the DSMS servers should then be (re)arranged every time a DSMS server is added. Moreover, splitting a CQ execution among several nodes is not a straightforward task (since some operators implementing the CQ execution logic might need to store a sequence of tuples) and might affect the overall QoS figures.
Despite these efforts and others, there still remains a great need to provide an improved DSMS which can reliably deliver improved QoS to a variety of client applications whilst implementing a load shedding process.