This invention relates to parallel distributed processing and a computer system for processing, by a plurality of distributed computers, a large quantity of sequence data in parallel.
In recent years, sensor devices utilizing, for example, radio frequency identification (RFID) and an integrated circuit (IC) card are used in various occasions. Moreover, a large quantity of sequence data can be acquired from these sensor devices. The sequence data is a set of pieces of data which include a plurality of data items each containing a set value, and are arranged according to a value of a given data item (referred to as sequence data item). Generally, the sequence data is accumulated in order of the sequence data item in a system, and taken out from the system and used while the order is still maintained.
There has been an attempt to analyze changes and trends of the sequence data, and use the analyzed results for business activities. For example, there is an attempt to install a plurality of sensors on a construction machine, and, based on trends and changes of time-series data acquired from the sensor devices, to analyze the state of the construction machine, so as to use the analyzed results for a maintenance of the construction machine. In this attempt, generally, an analysis application of the batch processing type is used to apply grouping and filtering to the large quantity of sequence data acquired from the sensor devices, and then to apply aggregation processing to the data focusing on the order of the sequence data item.
As a technology for realizing the processing carried out by the analysis application, MapReduce is known as disclosed in US 2008/0086442 and “MapReduce: Simplified Data Processing on Large Clusters” Jeffrey Dean, Sanjay Ghemawat, Google Inc. OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, Calif., Dec. 6, 2004. MapReduce is a programming model which simplifies the analysis processing applied to the data into group extraction processing (Map processing), and the data aggregation processing (Reduce processing). The group extraction processing is processing of grouping divided data using a data item (key) for extracting specific groups, and outputting results as intermediate datasets. The data aggregation processing is processing of aggregating the data by merging the intermediate datasets output by the group extraction processing, and outputting the results.
As a result, an execution engine of MapReduce can determine a unit of the division for the analysis application, and can control the parallel processing. Moreover, the processing can be allocated dynamically to a plurality of computers, and thus the execution engine of MapReduce is suitable for a system having a large-scale parallel configuration using a large number of computers. Further, for a developer, there is provided a merit that it is not necessary to be aware of how the distributed processing is carried out among the plurality of computers, and it is only necessary to define a method for the group extraction processing and a method for the data aggregation processing. Still further, for an operator, there is provided a merit that flexible sizing and scheduling are possible in a large-scale environment.