There is an increasing demand for a data processing system that processes in real time a large amount of data arriving from time to time. Examples are stock automatic trade, car probe, Web access monitoring, manufacturing monitoring, and the like.
Conventionally, a data base management system (hereinafter referred to as DBMS) has been positioned at the center of data management of an enterprise information system. DBMS stores data to be processed in storage, and achieves highly reliable processing such as transaction processing for the stored data. However, DBMS has difficulty in satisfying the above-described real-time capability because it performs retrieval processing for all data each time new data arrives. For example, in the case of financial application supporting stock transactions, one of the most important problems of the system is how quickly the system can respond to a change in stock price. However, a business chance may be lost because data retrieval processing cannot catch up with the speed of stock price change.
A stream data processing system is proposed as such a data processing system suitable for real time data processing. For example, a stream data processing system “STREAM” is disclosed in R. Motwani, J. Windom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein and R. Varma: “Query Processing, Resource Management, and Approximation in a Data Stream Management System”, In Proc. of the 2003 Conf. on Innovative Data Systems Research (CIDR), January 2003. The stream data processing system, unlike conventional DMBS, registers a query in the system and continuously executes the query upon the arrival of data. Since the query executed can be grasped in advance, high speed processing can be performed by, upon the arrival of new data, processing only differences from previous processing results. Therefore, by the stream data processing, data occurring at a high rate in stock transactions and the like can be analyzed in real time to monitor and utilize the occurrence of events useful for business.
To quickly process a large amount of data, distributed processing by plural computers (nodes) is demanded in the stream data processing. In the distributed processing, a method (hereinafter referred to as a pipeline parallelism method) that processes data in a different node for each of operators constituting a query, and a method (hereinafter referred to as a data parallelism method) that processes data in plural nodes for each data for a same operator are known. Particularly, the data parallelism method can significantly increase throughput because communication overhead does not increase so noticeably with an increase in the number of nodes, in comparison with the pipeline parallelism method.
In the data parallelism method, a method for allocating data to each node is calculated from a method for processing each operator. Languages for stream data processing descriptions are often written in languages similar to SQL (Structured Query Language) widely used in DBMS such as CQL (Continuous Query Language) disclosed in A. Arasu, S. Babu and J. Widom: “The CQL continuous query language: semantic foundations and query execution”, The VLDB Journal, Volume 15, Issue 2, pp. 121-142 (June 2006). A data partitioning method can be calculated by a method conforming to RDB. For example, like SQL, CQL has Join and Aggregation operators, by which how to partition data is determined by conditions of join and the unit of aggregation, respectively, like RDB. The data parallelism method is disclosed in: US Patent US2007/0288635; US Patent US2008/0168179; T. Johnson, M. S. Muthukrishnan, V. Shkapenyuk, O. Spatscheck: “Query-aware partitioning for monitoring massive network data streams”, SIGMOD, 2008; M. A. Shah, J. M. Hellerstein, S. Chandrasekaran, M. J. Franklin: “Flux: an adaptive partitioning operator for continuous query systems”, ICDE, 2003; and M. Ivanova, T. Risch: “Customizable parallel execution of scientific stream queries”, VLDB, 2005.