This invention relates to a method of, in a stream data processing system, generating time control information inside the system and between the systems.
There has been an increasing demand for a data processing system which carries out real-time processing for data continuously arriving at a database management system (hereafter, referred to as “DBMS”), which carries out processes for data stored in the storage system.
Data which continuously arrives is defined as stream data, and there has been proposed a stream data processing system as a data processing system suitable for the real-time processing for the stream data. For example, R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma: “Query Processing, Approximation, and Resource Management in a Data Stream Management System”, In Proc. of the 2003 Conf. on Innovative Data Systems Research (CIDR), (online), January 2003, (retrieved on Oct. 15, 2008), discloses a stream data processing system “STREAM”.
In the stream data processing system, first, queries are registered to the system, and the queries are executed continuously each time data arrives, which is different from the conventional DBMS. The above-mentioned STREAM employs an idea referred to as sliding window, which partially cuts stream data for efficiently processing the stream data to thereby impart lifetime to the data. As a preferred example of a query description language including a sliding window specification, there is a continuous query language (CQL) disclosed in R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma: “Query Processing, Approximation, and Resource Management in a Data Stream Management System”, In Proc. of the 2003 Conf. on Innovative Data Systems Research (CIDR), (online), January 2003, (retrieved on Oct. 15, 2008). The CQL includes an extension for specifying the sliding window by using parentheses following a stream name in a FROM clause of a structured query language (SQL), which is widely used for the DBMS.
As for SQL, there is known one disclosed in C. J. Date, Hugh Darwen: “A Guide to SQL Standard (4th Edition)”, the United States, Addison-Wesley Professional, Nov. 8, 1996, ISBN: 021964260. There are two types of typical methods for specifying the sliding window: (1) a method of specifying the number of data rows to be cut, and (2) a method of specifying a time interval containing data rows to be cut. For example, “Rows 50 Preceding” described in a second paragraph of R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma: “Query Processing, Approximation, and Resource Management in a Data Stream Management System”, In Proc. of the 2003 Conf. on Innovative Data Systems Research (CIDR), (online), January 2003, (retrieved on Oct. 15, 2008), is a preferred example of the item (1), in which data corresponding to 50 rows is cut to be processed, and “Range 15 Minutes Preceding” is a preferred example of the item (2), in which data for 15 minutes is cut to be processed. In the case of the item (1), the data lifetime is defined to be until 50 pieces of data arrive. In the case of the item (2), the data lifetime is defined to be 15 minutes. The stream data cut by the sliding window is held on a memory, and is used for the query processing.
In the stream data processing, event extraction through an analysis in which a plurality of data sources are combined, the extraction of events that have occurred within a given period of time, or other similar processing requires a heartbeat tuple (hereinafter abbreviated as HBT) to be generated and processed regularly within a data processing system. HBT is for advancing time during a period in which no data is generated. Each HBT has an HBT flag which indicates that it is an HBT, and time information which indicates the time of generation of the HBT.
To give an example, in join operation where data sources of two or more inputs are joined, the data sources of two or more inputs are obtained in chronological order. In the case where the first input includes an input from the data sources and the second input does not include an input from the data sources, data having an earlier time than that of the data sources of the first input may be input to the second input. Because of this fear, the first input may not be processed, resulting in a wait. In such a case, the wait is solved if an HBT having time information that is newer than that of the data sources of the first input is input to the data sources of the second input, thereby enabling the system to process the first input.
One of the known methods is disclosed in Yijian Bai, Hetal Thakkar, Haixun Wang, Carlo Zaniolo: “Optimizing Timestamp Management in Data Stream Management Systems”, IEEE 23rd International Conference on Data Engineering 2007, ICDE 2007, 15-20, April 2007, pp. 1334-1338, where each query maintains two states: an yield state (which means that there is data in an output queue) and a more state (which means that there is data in an input queue) to help determine an operator to be executed next. The method disclosed in Yijian Bai, Hetal Thakkar, Haixun Wang, Carlo Zaniolo: “Optimizing Timestamp Management in Data Stream Management Systems”, IEEE 23rd International Conference on Data Engineering 2007, ICDE 2007, 15-20, April 2007, pp. 1334-1338 executes an execution tree starting from its input side and going as far along the execution tree as possible, and then tracks the execution tree back to an operator that may be executed next. When the execution tree is tracked back to the input of stream data, Enabling Time-Stamps (ETSs, corresponding to HBTs) are propagated.
Another known method is disclosed in US 2008/0072221, where one logical time period is assigned to a plurality of physical time periods in dispersed input sources, and event streams are rearranged within a buffer according to their “output bookmark” values.