1. Field of the Invention
The present invention relates to a technology of processing continuously generated time series data, and in particular, to a technology of continuously executing general data processing including recursive processing in real time at a stable and low latency and at a high rate, in stream data processing.
2. Description of the Related Art
Stream data processing, which implements real-time processing of high rate data, which is based on the advancement in technology for analyzing information continuously generated at a high rate in real time, for example, information on automation of stock trading, enhancement of traffic information processing, and analysis of click stream, and instantly executing action, has been of interest. Since the stream data processing is a general-purpose middleware technology that can be applied to a variety of data processing, data in the real world can be reflected on business transactions in real time while responding to a sudden change in a business environment, which may not be sufficient for building a system for each item. The principle and implementation method of the stream data processing were disclosed in B. Babcock, S. Babu, M. Datar, R. Motwani and J. Widom, “Models and issues in data stream systems”, in: Proc. of PODS 2002, pp. 1-16. (2002).
The stream data processing inputs streams that are a series of data of points on a time base and converts them into a relation that is a set of data having a survival period by a window operator. The relational operator on the relation is performed and thus, the relation is returned to a stream by a streaming operator and then output. The relation is a middle state in the stream data processing. Each data on the stream is called a stream tuple. Similar to a record of the relational database, the stream tuple has a time stamp as an attribute in addition to using a combination of a plurality of columns as a value. The stream tuples on the stream are input to the stream data processing in an ascending order of the time stamp.
For example, a series of six stream tuples with the time stamp of time t1 to t6 is considered. Values of each tuple include two columns of a character string id and an integer value val, each value being (a, 1), (a, 2), (b, 1), (a, 1), (a, 2), (b, 3). On the other hand, as the window operator, a row based window, which limits a maximum simultaneous survival number, is applied. Herein, the simultaneous survival number is limited to three. At this time, a first tuple is converted into data surviving during a period using time t1 as a start point and time t4 at which a fourth tuple arrives as an end point. A just end point is not included in the survival period. Other window operators include a time window that assumes the survival period as a prescribed time and a partition window that groups the stream tuple having the same values of specific columns and limits the maximum simultaneous survival number for each group.
As a relational operator on a relation that is a set of data defining the foregoing survival period, an example of applying a summing operator SUM for the column val is considered. In the relational operator in respect to the stream data processing, a set of intersection points when the relation of inputs and results for the operator is cut at any time on a time base is the same as the relation of inputs and results in the operator of the conventional relational database. For example, since the data values of the intersection points where the relation of the above example is cut at time t4 become {(a, 2), (b, 1), (a, 1)}, the data value of the intersection point where the relation, which is the result, is cut at the same time becomes {(4)}. The result processing of the set of the former data values by the summing operator SUM (val) of the conventional relational database becomes the set of the latter data values. A similar relation can be established at any time.
In any two relations, when the set of the data values of the intersection points throughout the entire time is the same, both relations are congruent to each other. The result of the relational operator in respects to the congruent relations is also congruent.
An example of applying an operator called IStream as the streaming operator to the result of the foregoing relational operator can be considered. When the set of the data values of the intersection points of the relation is increased and decreased at any time, the streaming operator assumes the time as the time stamp to output the increased and decreased data value as the stream tuple. The IStream outputs the increased data value. Other streaming operators include DStream that outputs the decreased data value and RStream that outputs the data values that survive at each prescribed time. The result of applying the operator according to the above example outputs the stream tuples of {(1)}, {(3)}, {(4)}, and {(6)} at time t1, t2, t3, and t6, respectively. At this time, the stream tuples are not output at time t4 and t5. This is because the intersection points cut at any time of time t3 to t6 of the result relation of the relational operator are {4} at all times, that is, a set having only one element and thus, the value thereof is unchanged. As such, since the streaming operator performs the processing based on the increase and decrease of the data value, it can guarantee that the same stream is generated from the congruent relations. However, if it does not wait until the increase and decrease of all the relations at any time is fixed, there is a limitation that the result tuple cannot be output at the time.
Next, a definition method of query data processing in the stream data processing and a general execution control method will be described. A mechanism used herein is based on a declarative language called a continuous query language (CQL). The grammar of the CQL takes a format of adding the mechanism of the window operator and the streaming operator to a query language SQL based on a relational algebra that is used for the relational database as a standard. The CQL is disclosed in A. Arasu, S. Babu and J. Widom, “The CQL Continuous Query Language: Semantic Foundations and Query Execution”, (2005).
The following is an example of query definition.
REGISTER STREAM s1(id VARCHAR(30),val INT);
REGISTER STREAM s2(id VARCHAR(30),val INT);
REGISTER QUERY q
RSTREAM[30 SECOND] (
SELECT s1. id AS id1, s2. id AS id2, s1. val                FROM s1[RANGE 5 MINUTE], s2[ROWS 1]        WHERE s1. val=s2. val));        
wherein, the two commands starting at “REGISTER STREAM” are commands that define input receiving streams from a data source.
A first command defines an input stream having a name called s1. Further, data received in the input stream has a column called id and val and the forms are defined as a character string form and an integer number form. A second command defines an input stream having a name called s2. The definition of the column is the same as the input stream s1. A third command is a command that defines a query. The third command defines a query having a name called q. In a portion surrounded by a parenthesis “(” and “)”, the relational operator in respects to the relation is defined by the same grammar as the data processing language SQL in the relational database. The example specifies that the streams s1 and s2 are joined by the accordance of the value of the column val. A FROM-clause specifies a name of the input stream or a name of the query defined unlike. A portion continued to the stream name or the query name and surrounded by “[” and “]” is a mechanism that specifies the window operator. “s1[RANGE 5 MINUTE]” described in the example specifies that the stream tuple of the input stream s1 is converted into data whose survival time is 5 minutes by the time window. Meanwhile, “s2[ROWS 1]” specifies that the stream tuple of the input stream s2 limits the simultaneous survival data to the latest one by the row based window. In addition to this, there are [PARTITION BY NUMBER OF COLUMN NAME LIST ROWS] that is a mechanism specifying the partition window and [NOW] that is a mechanism limiting a survival period to a logical fine time less than a real time, that is, only an instant. One positioned before the portion surrounded by the parenthesis “(“ and ”)” is a mechanism that specifies the streaming operator. “RSTREAM[30 SECOND]” described in the example specifies the use of the RStream and outputs the data value of the relation surviving at each 30 seconds as the stream tuple. In addition to this, there are “ISTREAM” that is a mechanism specifying the IStream and “DSTREAM” that is a mechanism specifying the DStream.
In the stream data processing, the query defined by the foregoing mechanism is converted into a data structure called an operator tree and is processed. The operator tree is a tree structure that connects between operators executing the element data processing by a data queue and implements the processing by transmitting and receiving data between the operators by a pipeline manner. Since the data on the relation has a survival period, two tuples, which indicate the survival starting and the survival ending for one data, are transmitted and received. The former is called a plus tuple and the latter is called a minus tuple.
In the processing of the operator tree, a time order guaranteeing control is performed to keep an order of the data processing as the time stamp. For example, like the join in the query example, the operator, which assumes two relations as an object, becomes the operator of two inputs on the operator tree. The operator should first process an earlier tuple by comparing the time stamps of the tuples that are input to left and right queues. In the meantime, if the arrival of the data from one of the two data sources is congested, the comparison cannot be performed and the processing of data from the other data source is also congested. This phenomenon is called a stall. In order to prevent the stall, a method of transmitting a heartbeat tuple for recognizing that time progresses from the operator that is the leaf (input) of the operator tree even while the data does not come from the data source is a widely recognized method in the stream data processing. The execution control method using the heartbeat is disclosed in T. Johnson, S. Muthukrishnan, V. Shkapenyuk and O. Spatscheck, “A Heartbeat Mechanism and its Application in Gigascope”, in: Proc. of VLDB 2005, pp. 1079-1088.
Even in the operator that outputs the tuples by a binary operator as well as, time window or time limit ignition called RStream, the heartbeat tuple is still needed. For example, in the query example, the time window operator for the input stream s1 receives the plus tuple at 9:03′10, and there is a need to output the minus tuple after 5 minutes, that is, at a time of 9:08′10. If the data for the input stream s1 is congested, the minus tuple cannot be output. The heartbeat solves this problem. If the transmission interval of the heartbeat tuple is 1 minute, the minus tuple can be output by the heartbeat tuple of 9:09′00. This is similarly applied to the Rstream in the query example. Since the tuple is specified to be output at each 30 seconds, for example, the stream tuple of 9:02′30 is output by the arrival of the heartbeat tuple of 9:03′00. At this timing, the stream tuple of 9:03′00 cannot be output. As described above, when the streaming operator considers that unless all the tuples arrive at any given time (in this case, 9:03′00), there is a limitation in that the results cannot be output; since any tuple of 9:03′00 subsequent to the heartbeat tuple can also be reached, the output at this timing is not permitted.
In the stream data processing, there is data filter processing that processes the tuples received from only one input and then passes it as well as processing that needs the precision time control such as the binary operator and the time window and RStream. The heartbeat tuple performs a role of informing up to what time the processing can be executed on the operator, that is, a role of informing the executable time.
B. Babcock, S. Babu, M. Datar, R. Motwani, and D. Thomas, “Operator Scheduling in Data Stream Systems”, (2005) discloses a simple round robin and a technology of first executing the operator that outputs the earliest executable tuple, as an algorithm that searches the executable operator from the operator tree based on the time information.