Data processing may be performed using a distributed processing system that includes a plurality of nodes (for example, computers or other information processing apparatuses) connected to a network. By splitting and assigning data to a plurality of nodes and using the plurality of nodes in parallel, higher-speed data processing may be achieved. Such parallelization of data processing is employed for processing large amount of data, for example, for analyzing access logs indicating accesses to a server apparatus.
To support creation of a program for parallel data processing, frameworks such as MapReduce or the like have been proposed. A data processing method defined in MapReduce includes a Map phase and a Reduce phase. In the Map phase, input data is split into data blocks, which are then processed using a plurality of nodes. In the Reduce phase, the results obtained in the Map phase are aggregated using one or more nodes according to keys or the like. The results obtained in the Reduce phase may be given to the next Map phase. It is possible to cause the framework to automatically perform the data split and aggregation.
There has been proposed a distributed processing system that confirms a change in the amount of data before and after processing, and sets a higher distribution degree when the amount of data decreases or sets a lower distribution degree when the amount of data increases, to thereby prevent communication between nodes from becoming a bottleneck. In addition, to achieve higher-speed simulation in the electromagnetic analysis simulation for electric circuits, there has been proposed a method in which the analysis results of a main part are stored, and when an additional patch is inserted, electromagnetic analysis is performed only on the additional patch, using the stored analysis results of the main part.
Japanese Laid-open Patent Publication No. 2010-244470
Japanese Laid-open Patent Publication No. 2003-296395
Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, Proc. of the 6th Symposium on Operating Systems Design and Implementation, pp. 137-150, December 2004
In some distributed processing systems, data is split into blocks which are then processed through first-stage data processing using a plurality of nodes, and then the results of the first-stage data processing are processed through second-stage data processing. However, in the conventional distributed processing systems in which given data is automatically split and processed in parallel, the first-stage data processing may be performed on the entire data each time the data is entered, which means wasting the previous results of the data processing.