The present invention pertains to a parallel distributed processing method and a computer system for processing large amounts of series data in parallel using a plurality of distributed computers.
In recent years, the field of big data processing is receiving attention, which analyzes and processes large amounts of data to find and make use of findings having never been previously obtained. In the field of big data, sensor data obtained from a device, for example, has a form of data called series data. Series data is a data set of a plurality of data pieces arranged in accordance with respective sequential labels, each of the plurality of data pieces consisting of values for a plurality of data items.
When there is a need for analysis of large amounts of data, it is necessary to design a system for each processing in existing distributed analysis systems, resulting in high cost of system configuration.
Concerning such an issue, the MapReduce framework is proposed as core technology for implementing analysis processing with ease as described in Patent Literature 1 and Non-Patent Literature 1. The MapReduce framework is a programming model for writing an analysis procedure of data in two parts: an extraction procedure (MAP procedure) that extracts desired data from data store, and an aggregation procedure (Reduce procedure) that transforms the extracted data into a readily usable form or statistic information. It allows the execution engine of the MapReduce framework to determine the division unit of an analysis application and to control parallel processing.
However, the MapReduce framework is originally aimed at writing processing of unstructured and non-sequential data such as a search system in the Web. Thus, it is impossible to expect an increase in the processing performance for series data from the MapReduce framework. For example, the extraction procedure is executed simultaneously at a large number of infrastructures as a plurality of tasks. Thus, it greatly contributes to enhancing the processing speed; however, it is difficult to apply an analysis method usually applied to series data, such as moving average calculation and Fourier transformation.
The aggregation procedure is used for writing such processing in the MapReduce framework; however, it is difficult in the aggregation processing to increase the number of infrastructures for the processing to enhance the processing speed.
Concerning such an issue, a technique to utilize a stream processing infrastructure in the aggregation processing for speedup is known, as described in Non-Patent Literature 2. However, even if using the stream processing infrastructure, there is a problem that a waiting time for all data to be extracted in the extraction processing occurs, and the transmission of the extracted data directly to another server via a network results in an increase in communication loads. Further, the aggregation processing of series data does not always reduce the amount of data sufficiently in the writing process of the result, and the relocation of large amounts of data leads to increases in the communication and processing loads or a reduction in the speed.