Data Stream Management Systems (DSMS) address some of the main problems faced by data intensive applications. Examples of these applications, which require a fast analysis of a huge number of data coming simultaneously from different data sources, comprise applications having to take decisions based on measured figures (e.g. temperature, humidity, etc.) taken by remote sensing devices; wherein e.g. the application decides that an alarm is to be issued depending on a combination of values received from different sensing devices; or applications processing data reported from network nodes, such as telecommunication network nodes (e.g. events related to services related to their users); wherein a further management measured figure with respect to nodes of said network can be taken as dependent on the reported data.
In particular, the DSMS technology allows processing, in a real-time manner, by a DSMS server a plurality of input data coming continuously from a plurality of data sources, so that the DSMS server further produces output data resulting of executing by the DSMS server logical operations (e.g. filter operations, join operations, etc.) on the input data received from the data sources. The resulting output data produced by the DSMS are sent—also in a real time/continuous manner—to one or more servers implementing applications of the kind recited above. Accordingly, a DSMS server dispenses a further application server implementing a certain application service with the need of processing in a real-time manner data coming from a plurality of sources, so that said further application server only receives information—via the DSMS server—upon certain conditions (i.e. as determined by the logical operations performed by the DSMS).
According to “Stream Data Processing: a Quality of Service Perspective” (Springer, ISBN: 978-0-387-71002-0; e-ISBN: 978-0-387-71003-7; Ref [1]), the task of defining a query to be provisioned within a DSMS by skilled person (e.g. a system administrator) requires said person to be familiar with details about the data sources that send input data streams to said DSMS, as well as to be familiar with the nature of the data conveyed by each of said input data streams. Furthermore, in case of the streams to be produced by said CQ needs to fulfill some kind of QoS (e.g. in terms of precision, latency, etc.), said person is also required—at least—to be acquainted about the reporting configuration of the data sources, and—eventually—being also required to modify the reporting configuration of one or more of the data sources whose input data streams are involved in the execution of a CQ.
Traditional relational database management systems (DBMSs), consisting of a set of persistent relations, a set of well-defined operations, and highly optimized query processing and transaction management components, have been subject to intense research and are used for a wide range of applications.
Typically, data processed by a DBMS is not very frequently updated, and a snapshot of the database is used for processing queries.
FIG. 1A serves to illustrate the DBMS paradigm. A storage 100 receives updates 91 and queries 92 and outputs processed queries 93. Updates are not very frequent and DBMS queries are executed once over a snapshot of the database
In recent years, another class of data intensive applications has emerged, such as sensor data processing, network management in telecommunications networks and stock trading that need to process data at a high input-rate. These applications need to process data continuously over long periods of time and the data is typically received in the form of a data stream. As a result, the amount of data to be processed can be unlimited. At the same time, these applications need processing capabilities for continuously computing and aggregating incoming data for identifying changes or patterns in a timely manner.
These applications are different from traditional DBMS applications with respect to data arrival rates, update frequency, processing requirements, Quality of Service (QoS) needs, and notification support. Queries that are processed by a traditional DBMS are (typically) specified, optimized, and evaluated once over a snapshot of a database (“DBMS queries”).
In contrast, queries in a stream processing environment are specified once and evaluated repeatedly against new data over a specified life span or as long as there exists data in the stream. They are long-running queries that produce output continuously. The result is also assumed to be a stream, possibly with differing rates and schema (as compared to the input). These queries are termed “continuous queries” (CQs). FIG. 1B serves to illustrate the DSMS paradigm. A storage 100 receives incoming data 91′ of real time feeds 94. Queries 92′ for particular output data streams are received and output data streams 93′ are provided. Queries executed in DSMS are termed continuous queries since they are continuously executed over new incoming data.
Although traditional DBMSs might be used in stream processing scenarios, the procedure would then require the steps of loading the incoming data streams into persistent relations executing the same DBMS queries over these relations repeatedly. The main problem with this approach is that the storage of stream data, indexing (as needed) and querying will add considerable delay (or latency) in the response time that may not be acceptable to many stream applications.
The requirement that data needs to be persisted on secondary storage device (that has high latency) before it can be accessed and processed by a DBMS in main memory (that has low latency) is at the core of this mismatch. In addition, the “snapshot” approach for evaluating stream data may not always be appropriate as the values over an interval might be important (e.g., temperature changes) for stream processing applications. Furthermore, the inability to specify quality of service requirements (such as latency or response time) in most of traditional DBMS makes its usage less acceptable for stream applications.
Hence, the techniques developed for DBMSs need to be re-examined to meet the requirements of applications that use stream data. This re-examination has given rise to a paradigm shift along with new approaches and extensions to current techniques for query modeling, optimization, and data processing to meet the requirements of an increasing number of stream-based applications. Systems that have been developed to process data streams to meet the needs of stream based applications are termed Data Stream Management Systems (DSMSs) in the literature.
As opposed to a traditional Database Management System (DBMS), a DSMS is not reactive in the sense that it executes a query only if a request is received from another server and over a “snapshot” of the data it stores. Rather, a DSMS can be active in the sense that it executes queries (i.e. the so called “continuous queries”, CQ) in a continuous manner on data contents of a set of input data streams that it continuously receives, and produces, as a result, a set of output data streams which are sent from the DSMS to one or more further servers. The latter sending of the output data streams can also be effected in a continuous manner.
Whenever a new continuous query (CQ) is entered into a DSMS system a query plan must be generated (in a similar way as traditional DBMSs actually do), although in some DSMS systems the query language is specified at such a low level that it might be directly handled as a query plan by itself.
A query plan could be understood as a sequence of basic (pre-defined) operators yielding the expected query result. For example, when a SQL query is sent to a traditional data base (i.e. a database managed by a DBMS), the DBMS, after parsing the query, generates this sequence of basic operators implementing the query logic. The nature of these operators depends on the specific vendor.
In a DSMS the kind of basic operators in which a query is decomposed can comprise “stateless” as well as “stateful” query operators. Generally, “stateless” operators do not impose any special requirement to data streams, since their logic can be executed in a rather straight forward way. One case of a “stateless” operator can comprise an operator implementing a “filter”; for example, data whose value exceeds a predetermined value would go through, whilst data not reaching the value would be discarded.
However “stateful” operators involve some internal storage in order to come up with a final result. As data streams are unbounded in nature, stateful operators should work only upon a finite subset of the data stream. One example would be an operator implementing the average value of the previously received data (e.g.: in a certain interval, once a certain number of data have been received, etc.). If the final value to be produced by a “stateful” operator had to take the whole data stream into consideration, the result would (likely) never be produced.
It is thus necessary to specify by the CQ the subset of data for which the average value is to be calculated. This subset is called a “window” and it is—normally—specified as a function of time (e.g. 3 seconds), or as a function of the number of received data items (e.g. 40 data in a row). In this way, a result is continuously produced. Multiple queries can be executed at the same time within the DSMS, and each single query plan can share operators—or even part of its query plan—with other queries. Moreover, more than one application can be registered to the same query and more than one input stream can be part of the same query.
As an illustrating example, FIG. 2 shows (in a simplified manner) two CQs being executed in the same DSMS. In the example, the DSMS receives data via two input data streams (referenced with 901 and 902, respectively), and produces two output data streams towards different destinations outside the DSMS (illustrated as a first server implementing a first application 31, and a second server implementing a second application 32). In FIG. 2 Op1 to Op8 stand for query operators, wherein the query operators linked by the continuous lines execute a first CQ (i.e. implement the query plan of said first CQ), and wherein the query operators linked by the broken lines execute a second CQ.
A query operator can implement different operations, such as a “filter” based on received values from input data streams (e.g. only certain data matching a certain value, or exceeding/below a given threshold, are further processed), and/or a “join” of values received from one or more input data streams (e.g. only certain data coming from a first data stream are considered for further processing depending on certain matching values received from a second data stream).
For the sake of illustration, stream 901 could be received e.g. from a first telecommunications node providing a location registration service (such as a “Home Subscriber Server”, HSS) which sends towards the DSMS a data stream containing information about registration events of user from their terminals (e.g. the data stream comprising data identifying a user, identifying the roaming access network to which a user currently attaches, whether a user registers or deregisters, etc.), and stream 902 could be received e.g. from a second telecommunications node providing multimedia communications services to a plurality of users (such as a “Proxy-Call Session Control Function”, P-CSCF) which sends towards the DSMS information about communication service events related to said users (e.g. identifying the user, session initiation/termination events, the kind of communication service established, etc.).
Input streams constitute a key element in every query plan since they provide the raw data that should be further processed in the query execution. According to the conventional arts, every query registered into a DSMS needs to explicitly indicate the one or more input data stream(s) for which the corresponding data should be extracted and analyzed, as well as the specific criteria to build up the corresponding query plan.
As a result, the query plan derived from a CQ executed by a DSMS contains operators whose main task consists of extracting the data coming from the input streams. In a next step, these data are sent out to the corresponding operators implementing the query logic. As an example, operators Op1 and Op6 in FIG. 2 extract the data coming from input stream 901 and 902 and send them out to operators Op2/Op4 and Op8, respectively.
Input data streams are fed into a DSMS coming from a single data source, or coming from a plurality of data sources. Examples of data sources comprise, a sensor sending a certain measured data (e.g. related to a current condition, such as a measured temperature, a measured geographical position, etc.), or a telecommunications node sending information about service usage by a certain user (e.g. by means of the so called “call detailed records”, CDRs).
Data sources might support different event reporting configurations. Each configuration is normally tied to a different event reporting granularity. For example, a telecommunications node can be configured in such a way that only one specific type of multimedia sessions is reported towards the DSMS. But it would also be possible to configure said node in order to send more detailed information, covering e.g. other multimedia session types as well as lower level information of said sessions, and/or to report information about other kind of events.
In any case, the way on which reporting configurations are configured in a data source (i.e. the quantity and/or frequency of data said data source has to send data towards a DSMS) can impact seriously the performance of said data source, mainly in the case where said data reporting task is not the main task that is to be performed by said data source. For example, the main task of a node implementing a HSS functionality in a telecommunications network is to reply in very short time to messages coming from other nodes in said system, which request to store location information of a terminal registered for a user, and/or which request to answer with said location information when a service is requested towards said user. Accordingly, the performance of the basic functionalities which are to be performed by the HSS node can be harmed by its duty to report events to a DSMS (e.g. events related to user registration, location, terminating services, etc.).
Conventionally, whenever a person (such as a system administrator) registers within a DSMS a continuous query (referred also hereinafter as a “business query”), he/she must clearly specify the corresponding input streams that are received by the DSMS that convey the data on which said query has to operate. However this might result in several drawbacks:
First of all, this kind of approach requires the person (e.g. a system administrator) that defines the business queries that is/are to be registered within the DSMS (so as to be executed therein as CQ/s) to be familiar with all the data sources that send input data streams to said DSMS, as well as with the nature of the data conveyed by each of said input data streams.
In case one or more of the data source/s involved for a CQ become unavailable (because, e.g., it crashes, or it is overloaded by its main duties) the CQ will likely fail to produce any result (at least with a “good enough” QoS). This kind of downfall events can barely be predicted by the person that manually configures business queries in the form of CQs into a DSMS.
Last but not least, data sources might have different reporting configurations (e.g., event data models, notification frequency, etc.). Modifying the reporting configuration in a certain data source (such as a telecommunications node assigned to perform a certain main functionality in addition to said data reporting) can impact its performance in a way that might not be easily assessed in advance by the person that provisions CQs in the DSMS.
For example, in the case of a data source being a telecommunications node performing a specific function within a telecommunications system, said impact can depend e.g. on the traffic actually handled by said node performing its basic duties with respect to said telecommunications system. For example, if the processing and/or communication resources of said node are almost overloaded for accomplishing with its basic duties, then no room for event reporting activities towards a DSMS would be available within the node for reporting tasks. However, after some time, the node might have sufficient resources for event reporting tasks. That is, for the same business query (CQ), the optimal implementation can change over time with respect to the data sources that provide the data for said query, and their respective data reporting configurations.
Moreover, a CQ to be provisioned in a DSMS can specify QoS requirements for executing said CQ within the DSMS. In short, QoS specified for a CQ is usually related to the particular needs of the specific application hosted by the server that will be the destination of the subsequently produced output data streams. The QoS specified for a CQ can comprise metric values indicative of, e.g., frequency or delay for the data conveyed by the corresponding output stream (for example, if said data can be sent in a bursty manner, or sent on a regularly basis), e.g. a value indicative of an accuracy on said data (for example, in terms of error tolerance which can be useful in case the CQ involves “stateful operators” and, thus, “execution windows”), etc.
The task of defining a CQ to be provisioned within a DSMS by a system administrator may thus require that he/she is familiar with details about the data sources that send input data streams to said DSMS, as well as familiar with the nature of the data conveyed by each of said input data streams. Furthermore, in case of the streams to be produced by said CQ needs to fulfill some kind of QoS (e.g. in terms of precision, latency, etc.), said person is also required to be acquainted about the reporting configuration of the data sources, and being also required to modify the reporting configuration of one or more of the data sources whose input data streams are involved in the execution of a CQ.
However, whilst this kind of—let's say—manually based solutions can be assumed to work well for a simple data reporting scenario comprising just a few of data sources with well-defined data reporting schemas, such a kind of solutions cannot scale well when coming to face more complex scenarios; for example scenarios comprising a plurality of data sources (which can be a significant number of data sources), as well as a plurality of eventual applications (which can be a significant number of applications) that might require to collect data—according to criteria that can even vary frequently—from a plurality of data sources, which number and/or nature that can also vary.