Field
The example implementations described herein are related generally to computer systems and, more particularly, to a stream data processing method with time adjustment.
Related Art
Stream data processing is widely used in the related art. There has been an increasing demand for a data processing system which carries out real-time processing for data continuously arriving at a database management system (hereafter, referred to as “DBMS”), which carries out processes for data stored in the storage system. For example, in a system for trading stocks, how fast the system can react to changes in stock prices is one of the most important factors, and a method such as the one carried out by a conventional DBMS, in which stock data is first stored in a storage system and then the stored data is searched for, cannot immediately respond in correspondence with the speed of the changes in stock prices, and may result in losing business opportunities.
For example, the related art involves a mechanism which issues stored queries periodically. However, it is difficult to apply this mechanism to the real time data processing for executing a query immediately after data such as stock prices is input.
Data which continuously arrives is defined as stream data, and there has been proposed a stream data processing system as a data processing system suitable for the real-time processing for the stream data.
In the stream data processing system, first, queries are registered to the system, and the queries are executed continuously each time data arrives, which is different from the related art DBMS. The related art implementations employ a sliding window, which partially cuts stream data for efficiently processing the stream data to thereby impart a lifetime to the data. As an example of query description language including a sliding window specification, there is a continuous query language (CQL) in the related art. The CQL includes an extension for specifying the sliding window by using parentheses following a stream name in a FROM clause of a structured query language (SQL), which is widely used for DBMS in the related art.
There are two types of related art methods for specifying the sliding window: (1) a method of specifying the number of data rows to be cut, and (2) a method of specifying a time interval containing data rows to be cut. For example, “Rows 50 Preceding” is a related art example of item (1), in which data corresponding to 50 rows is cut to be processed, and “Range 15 Minutes Preceding” is a related art example of item (2), in which data for 15 minutes is cut to be processed. In the case of item (1), the data lifetime is defined to be until 50 pieces of data arrive. In the case of item (2), the data lifetime is defined to be 15 minutes. The stream data cut by the sliding window is retained on a memory, and is used for the query processing.
In stream data, data sometimes arrives with a delay depending on the state of a network, a device, or the like. For example, a sensor node does not transmit data if the network is disconnected, and transmits the data collectively when a connection is again established.
Developers may write CQL to keep stream data in a certain period to monitor sensor status, catch some abnormal point, and predict future failure.
Related art stream data processing servers process stream data based on a data arrival timestamp. When the data arrives with a delay, aggregation results within a certain period based on an arrival timestamp are different from the results based on a data source timestamp.
Some types of stream data processing servers have a capability to process stream data based on a data source timestamp. However, the server should wait until all data arrives to the server. The processing latency gets longer as a result.
In FIG. 1, sensors 101, 102, and 103 are connected with stream data processing server 121 by network (NW) 111. Development client 131 sends query 151 written in CQL to stream data processing server 121. Stream data processing server 121 processes based on queries sent by development client 131. Visualization client 132 displays the results processed in stream data processing server 121. File server 133 stores the results processed in stream data processing server 121. Tuples 141, 142 and 143 are sent by sensor 101. Tuples 144, 145, 146 are sent by sensor 102. These tuples are processed into corresponding tuples 171-176.
For example, sensor 101 sends tuple (each record in stream data) 141 with timestamp “9:00:01” into stream data processing server 121. Sensor 102 also sends tuple 144 with timestamp “9:00:01” into stream data processing server 121. A 3-second summation of the value of sensor 101 is calculated as “1+2+3=3” at 9:00:03 (tuple 173).
When tuple 146 arrives at stream data processing server 121 at “9:00:04” by a delay on a state of a NW, a 3-second summation of sensor 102 value is calculated as “1+2=3”, although developer 161 expects the result “1+2+3=6”.
FIG. 2 shows a time chart of Query 151, “rstream [1 second] (select id, sum(val) from S1[range 3 second] group by id)”. This means that stream data processing server 121 keeps three seconds of stream data S1 and calculates the summation in each group “id” and outputs the current id and summation data (“id, sum(val)) every one second.
Tuples 201-209, and 211-219 are sent at various time intervals, processed by a Range 3 second function into tuples 221-229, 231-239, undergo a sum(val) function at 241-251, 261-271, and returned to RStream at 281-289, 291-299. Here, tuple 201 arrives at 9:00:01. Tuple 204 arrives after 9:00:04 though tuple 204 has a data source timestamp of 9:00:04.
Each black circle, each white circle, and each line connecting the two circles indicates a predetermined lifetime (three seconds in this example) of each tuple. For example, it is indicated that the tuple 221 has the values (data source timestamp data sensor ID, value)=(9:00:01, a, 1), and the lifetime thereof is from 9:00:01 until 9:00:04. It should be noted that the black circle means that a current point in time is included, while the white circle means that the current point in time is excluded.
For example, the summation of sensor “a” at 9:00:03 is 6 (tuple 243) because tuples 221, 222, and 223 are in its lifetime. At 9:00:04 the lifetime of tuple 221 ends. The summation changes into 5 (tuple 244). After tuple 204 arrives, the summation changes again into 9. Based on the data source timestamp, the summation at 9:00:04 should be 9. However, the actual result is 5. In the same manner, the result at 9:00:07 is 22 (tuple 248) though the result based on the data source timestamp should be 18.
RStream [1 second] outputs the current summation results every second. Tuple 283 (a,6) is sent at 9:00:03 and Tuple 284 (a,5) is sent at 9:00:04. Based on the data source timestamp, tuple 284 should be (a,9). However, the actual result is (a,5) due to delay. In the same manner, the result at 9:00:06 is Tuple 287 (a, 22) though the result based on data source timestamp should be (a,18). As a result, it may become difficult to provide the result based on a data source timestamp in stream data processing server based on an arrival timestamp.