Traditional relational database management systems (DBMSs) have been researched for over thirty years and are used for a wide range of applications. One of their key features is the storage of data as a collection of persistent “relations”, often referred to as tables. A relation is defined as a set of tuples that have the same attributes, each tuple comprising an ordered set of one or more data elements. In a DBMS, a table for relation) is organised into rows and columns. Each row of the table represents a tuple and each column represents an attribute common to all tuples (rows).
Another key feature of a DBMS is a set of well-defined operations (or “queries”) that can be issued by any DBMS client in order to read, write, delete or modify the stored data. Structured Query Language (SQL) is the most widespread query language for this purpose, although it is often enriched with proprietary add-ons.
The conventional DBMS is also characterised by having highly optimised query processing and transaction management components, as illustrated in FIG. 1. A query from a DBMS client 1 is received by the DBMS 2, parsed by a query parsing unit 3 of the DBMS, and analysed in order to verify that it is both syntactically and semantically correct. Once this is done, a query plan is generated by the DBMS's query planner 4. A query plan is a set of step-by-step instructions defining how the query is to be executed, whose details depend on how the concrete DBMS is implemented. The query plan aims to optimise, for example, the number of accesses to the physical storage device 5 (e.g. a hard disk) in order to speed up the execution time.
Queries that are processed by a traditional DBMS are termed “ad hoc” queries. That is, the query is sent to the DBMS and the response to that query, which is both valid at that specific moment and complete, is sent back. Traditional (ad hoc) queries are typically specified in a particular format, optimised, and evaluated once over a “snapshot” of a database; in other words, over a static view of the data in the database. The stored data which is to be operated on during processing of the query must be stable, i.e. not subject to any other ongoing database transaction since, for example, a high ratio of write queries can harm the performance of the DBMS serving read queries.
However, in recent years, there has emerged another class of data intensive applications (such as those intended for sensor data processing, network management in telecommunications networks and stock trading) that need to process data at a very high input rate. Moreover, these applications need to process data that is typically received continuously over long periods of time in the form of a data stream, for example to identify interesting changes or patterns in a timely manner. These applications typically differ from traditional DBMS applications with respect to data arrival rates, update frequency, processing requirements, Quality of Service (QoS) needs, and notification support.
In principle, stream data could be processed by a traditional database management system, by loading incoming stream data into persistent relations and repeatedly executing the same ad hoc queries over these relations. However, there are several problems with this approach. Firstly, the storage of stream data, indexing (as needed) and querying would add considerable delay (or latency) in response time, which may not be acceptable to many stream-based applications. At the core of this mismatch is the requirement that data needs to be persisted on a secondary storage device 5, such as a hard disk typically having a high storage capacity and high latency, before it can be accessed and processed by a DBMS 2 implemented in main memory, such as a RAM-based storage device having a lower latency but typically lower storage capacity.
In addition, the above-described “snapshot” approach to evaluating stream data may not always be appropriate since the changes in values over an interval can be important for stream processing applications, for example where the application needs to make a decision based on changes in a monitored temperature. Furthermore, the inability to specify QoS requirements for processing a query (such as latency or response time) to a traditional DBMS makes its usage less acceptable for stream-based applications.
Hence, the techniques developed for DBMSs need to be re-examined to meet the requirements of applications that use stream data. This re-examination has given rise to a paradigm shift along with new approaches and extensions to current techniques for query modelling, optimization, and data processing in order to meet the requirements of an increasing number of stream-based applications. Systems that have been developed to process data streams to meet the needs of stream-based applications are widely known as data stream management systems (DSMSs). An overview of data stream management systems is provided in “Stream Data Processing: A Quality of Service Perspective” by S. Chakravarthy and Q. Jiang (ISBN: 978-387-71002-0). In the following, a brief review of certain keys aspects of DSMSs that are necessary for understanding the concepts described herein is provided.
FIG. 2 shows a DSMS 10 together with a DSMS client 20. Queries for DSMS 10 are also expressed in a standard language similar to SQL (e.g. Continuous Query Language (CQL) and its derivatives) and a query plan is also produced, by a query parsing/planning unit 6. The logical query plan is analogous to a query tree used in a conventional DBMS. However, the queries executed in a DSMS are termed “continuous queries” (CQs) and differ from their DBMS counterparts principally by being specified once (commonly via provisioning, e.g. via operation and maintenance interfaces) and then evaluated repeatedly against new data over a specified life span or as long as there is data in the input stream(s) 11. In some DSMS systems, the query language is specified at such a low level that it might be directly handled as a query plan by itself.
Incoming streams 11 to, and outgoing streams 12 from, the DSMS 10 can be regarded as an unbounded sequence of data items that are usually ordered either explicitly by a time-based reference such as a time stamp, or by the values of one or more data elements (e.g. the packet sequence identifier in an IP session). A data item of a data stream can be regarded as a tuple of a relation. In this context, tuples comprise a known sequence of fields and essentially correspond with application-specific information. Hereinafter, the terms “data item” and “tuple” are used interchangeably.
The query plan generated on the basis of the specified CQ consists of algorithms for implementing a (typically large) number of relational operators, such as “select”, “project”, “join” and “aggregation” operators. As will be explained in the following, the inputs and outputs of these operators are interconnected to process a data stream in the desired way to yield the query result, such that the operators in a query plan may be regarded as nodes of a network. The operators act on data elements as they arrive and cannot assume the data stream to be finite.
In a DSMS, the query operators can be classified as being either “stateless” operators (also referred to in the literature as “non-blocking” operators) or “stateful” (or “blocking”) operators.
Generally, stateless operators process data elements from a data stream individually (i.e. a single data element at a time) and thus do not impose any special requirement on data streams, since their logic can be executed in a rather straight-forward way. One example of a stateless operator is an operator implementing a “filter” function. For example, a filter operator may process a received data stream by allowing data elements whose values exceed a predetermined value to pass through, and discard other data elements. Other examples of stateless operators include the aforementioned “select” and “project” operators.
On the other hand, there are operators, such as the “join” and “sort” operators that naturally operate on complete data sets and will therefore produce no output until the data stream ends. Of course, if the result to be produced by such a “stateful” operator had to be obtained by processing the whole data stream, the result would (likely) never be produced. In order to output results continuously, stateful (blocking) operators need to be converted into stateless (non-blocking) operators, and this is often achieved by employing the concept of a “window” to produce finite relations out of a stream. The window is used to define a finite subset of the data in the data stream, which is to be processed by the stateful operator. The window can be specified in a number of different ways, for example as a function of time (e.g. 3 seconds) or as a function of the number of received data items (e.g. 40 data items in a row).
Multiple queries can be executed at the same time within the DSMS 10, and each single query plan can share operators, or even part of its query plan, with other queries. Moreover, more than one DSMS client application 20 can be registered to the same query, and more than one input stream 11 can be part of the same query.
By way of a simplified example, FIG. 3 shows two CQs being executed in the same DSMS. In this example, the DSMS receives data via two input data streams (illustrated as “Stream 1” and “Stream 2”) and produces two output data streams towards different destinations outside the DSMS, namely a first server implementing a first application “App1”, and a second server implementing a second application “App2”. In FIG. 3, “On” represents a query operator, and query operators linked by the continuous lines execute a first CQ (i.e. implement the query plan of said first CQ). The query operators linked by the dashed lines execute a second CQ. A query operator can implement different operations, such as a “filter”, based on received values from input data streams (for example, so that only certain data elements matching a certain value, or alternatively exceeding or being below a given threshold, are processed further) and/or a “join” of values received from one or more input data streams (for example, so that only certain data elements coming from a first data stream are considered for further processing, depending on certain matching values received from a second data stream).
For example, “Stream 1” could be received from a first telecommunications node providing a location registration service (such as a “Home Subscriber Server”, HSS), which sends towards the DSMS a data stream containing information about registration events of users from their terminals (e.g. the data stream comprising data identifying a user, identifying the roaming access network to which a user currently attaches, whether a user registers or deregisters, etc.). “Stream 2” could be received e.g. from a second telecommunications node providing multimedia communications services to a plurality of users (such as a “Proxy-Call Session Control Function”, P-CSCF), which sends towards the DSMS information about communication service events related to said users (e.g. identifying the user, session initiation/termination events, the kind of communication service established, and so on).
In many practical applications, the query plan associated with a continuous query is usually a highly complex entity consisting of a large number of stateful and stateless operators, wherein each operator is associated with a memory queue (or buffer) for buffering tuples during bursty input periods (in order not to lose incoming or partially processed data), with the stateful operators requiring resources (primarily main memory) to hold state information to perform window-based computations successfully. For example, the “symmetric hash join” operator requires hash tables for its two relations for the duration of the window.
In such applications, and where the processing of huge amounts of data that are received at high rates via a large number of data streams is required (as is often the case in practice), the processing burden placed on an implementation of the DSMS in a single piece of hardware may exceed the hardware capabilities, leading to an unacceptable drop in the QoS. Furthermore, a stand-alone DSMS may not be able to handle the required number of input connections. In these cases, it becomes necessary to deploy the DSMS query plan in a distributed environment, where the incoming data streams and/or CQ processing load are/is shared among separate data processing hardware components.
The CQ processing load may be distributed by replicating the same query plan on separate data stream management systems, so that each DSMS executes the same query plan on a respectively assigned subset of the data streams that are to be processed. In this kind of distributed data stream processing environment, the data processing load is shared among a plurality of separate DSMSs that execute the same CQ. This approach can also be taken where the stream sources belong to different administrative domains, so that a separate DSMS server (or other DSMS hardware implementation) is provided to handle the data tied to each domain.
In these kinds of scenarios, and under certain premises, the query execution logic might not require all of the data coming from each possible data source to be gathered in order to come up with the final result. For instance, in user-centric networks (i.e. those in which the data includes a user identifier, such as the MSISDN or IMSI in a telecommunications network), the query logic usually depends on specific behaviours of users or groups of users (for example, the query could stipulate: “determine the number of voice calls that users categorized as “gold” subscribers place in the next two hours”). Thus, as long as data belonging to different users can be processed independently, is possible to replicate the corresponding query as many times as needed.
An example of a distributed data stream processing system comprising a plurality of DSMSs that replicate the query plan is illustrated in FIG. 4. In this example, two DSMSs, namely DSMS #1 and DSMS #2, are arranged to receive data streams from a subset #1 of the available data sources and a subset #2 of the data sources, respectively, and to use the same query plan to process the received data streams. Depending on the query logic, this strategy might need a further post-processing stage that merges the partial results obtained by each DSMS, as also shown in FIG. 4.
Despite the above-described distribution of the CQ processing workload, one or more of the DSMSs may still lack adequate computing resources to fulfil performance requirements, due to the over-demanding requirements of specific query operators. For example, an operator storing large amounts of historical data may demand large amounts of memory. That is, available computing resources (CPU, memory, etc.) are limited and might not fit the overall query demands. Furthermore, even if the computing resources do fit the specific query demands, the number of queries being simultaneously executed may be so high that it is not possible to assure that all of them will meet their QoS requirements.
Under these circumstances, it may be desirable to further distribute the CQ processing workload by partitioning a query plan in one or more of the DSMSs of the distributed data stream processing system into different parts, each containing at least one query operator, and deploying each of the parts on a different data processing apparatus. This approach is illustrated in FIG. 5, where the query plan of FIG. 3 is partitioned into four parts, namely Part #1, Part #2, Part #3 and Part #4, and each of the parts is assigned to a different data processing apparatus (e.g. a stand-along computer such as a server) for execution.
However, even when the above-described measures have been taken to distribute the CQ processing workload, the bursty nature of the incoming streams can still prevent the DSMSs from maintaining the required tuple processing rate whilst data is being received at an excessively high rate. As a result, a large number of unprocessed or partially processed tuples can become backlogged in the data stream processing system, causing the tuple processing latency to increase without bound. Due to the predefined QoS requirements of a CQ, query results that violate the QoS requirements may become useless.
One known approach to dealing with such data overload situations is so-called “load shedding”. When the DSMS 10 is overloaded with data from the input data stream(s) 11, load shedding is performed, i.e. at least some of the data items (tuples) as received by the DSMS or partially processed by the DSMS are discarded in order to reduce the processing burden of the DSMS 10 in generating the output data stream. In other words, load shedding involves selecting which of the tuples should be discarded, and/or in which phase of the CQ execution the tuple(s) should be dropped. Such discarding of tuples is often acceptable as many stream-based applications can tolerate approximate results.
In a conventional distributed data stream processing system as described above, wherein the query plan in each DSMS of the data stream processing system includes a window-based (stateful) operator, the situation can arise where reception of a windowed portion of the data stream switches at least once from one DSMS to another, so that two or more of the DSMSs in the system receive data from the windowed portion of the data stream. Under these circumstances, the accuracy of the respective query results obtained by those DSMSs may be poor, owing to the relatively small set of data that is received by each DSMS (as compared to the case where no switching occurs). For example, if a query operator stores the top five visited URLs in a time window covering the last two hours, and an input data stream switches from being received by a first DSMS to a second DSMS when one hour and thirty minutes have already elapsed, then the result provided by the second DSMS will likely be less accurate than that provided by the first DSMS.
Moreover, even in cases where it is possible to replicate the query without affecting the validity of the final result, it would still be necessary to develop a post-processing stage capable of gathering the results from the DSMSs that have received and processed the windowed portion of the data stream. However, this approach would require a specific design/implementation for every issued query, which might require a deep understanding of the query logic (that can be very complex) as well as advanced programming skills. Moreover, this logic should also take into account the query QoS constraints (e.g. time execution). Furthermore, in case the post-processing stage is common to every query being executed in the system, the complexity would increase. Further still, if the post-processing stage is implemented per query basis, the usage of computing resources will not be optimal (e.g. the outcome of two different queries might share several output streams). The time needed for tuning the post-processing stage might be unacceptable in many cases; the requested queries should be executed almost immediately, in order to obtain results as soon as possible. Finally, if the number of DSMSs in the distributed data stream processing system changes over time, the post-processing logic would likely be affected.