Recent years have seen the use of data processing systems to collect and analyze a large amount of data and extract useful information from the collected data. Some of those large-scale data processing systems are designed to sequentially manipulate data as it arrives in the form of a continuous stream. The large-scale stream data processing systems of this kind are often used for a realtime analysis of data. As an example, one system collects records of credit card payments and detects credit cards that appear to be used illegally, such as those used repeatedly in different stores within a short time period. As another example, there is a system that estimates traffic congestion on the basis of realtime data of vehicle speeds collected with sensors deployed on the roads.
The large-scale stream data processing discussed above may also be called “complex event processing” (CEP). For example, the user of CEP creates a program module that describes data patterns to be detected, as well as a data processing method and other things suitable for those patterns. When executed, this program module keeps track of continuously arriving data and extracts and processes data that matches with any of the patterns. The term “query” may be used in the context of CEP to refer to such a program module or to an individual runtime instance of that program module.
To deal with many pieces of data arriving in each unit time, a large-scale stream data processing system may be implemented as a distributed system having a plurality of computers (physical machines). Such distributed systems are capable of executing identical program modules (e.g., copies of a single program module) at different computers in parallel. The incoming data is distributed to a plurality of processes (instances) that are running on different computers according to such a program module. Data processing operations described in the program module are parallelized in this way, and the system provides enhanced performance.
One proposed distributed system synchronizes a plurality of nodes in terms of their time of date (TOD). Specifically, this distributed system selects a master node out of the plurality of nodes and causes the master node to broadcast a TOD packet containing the current TOD value to other nodes. The receiving nodes update their local TOD values on the basis of the received TOD packet.
See, for example, Japanese Laid-open Patent Publication No. 2001-297071.
The users may create a program module that relies on a timer to execute data manipulation according to the relative time since a specific time point at which a certain condition is met. Suppose, for example, a program module that accumulates data during a period of N seconds for later analysis, where N is a positive integer. The timer gives this N-second period, when a certain type of data arrives for the first time.
The above timer-reliant program module may be executed as is on a plurality of computers for parallel data processing. This simple parallelization, however, has some problems described below. Data is manipulated in a distributed manner by a plurality of processes running on different computers. It is therefore possible that different processes may recognize their “certain condition” (e.g., their reception of the first data) at different times. Further, the computers usually manage their respective timers independently of one another, and the plurality of processes of the timer-reliant program module could run with some misalignment of time bases, thus producing data processing results that are different from those produced in the case of serial data processing.
One possible solution is to create a program module taking into consideration the possibility of parallel execution on multiple computers. In other words, the user is to write a parallel-executable program module in an explicit manner. This approach, however, imposes a greater load on the user and also makes it difficult for the system to manage its parallelism.