As the cloud computing is actively distributed, a distributed processing system is being used, which distributes big data among a plurality of servers, and executes processing on the distributed big data. The big data is a large amount of data stored on a cloud. This system is also used to analyze the big data for use in various services. For example, customers' taste on purchasing products may be analyzed by analyzing point card attribute information and Point of Sale (POS) system data.
As an example of such a distributed processing system, Hadoop® is known, which adopts Hadoop Distributed File System (HDFS) and MapReduce as basic technologies. HDFS is a file system which distributes and stores data in a plurality of servers. MapReduce is a mechanism for distributing and processing data on HDFS in units called tasks, and executes a Map process, a Shuffle sort process, and a Reduce process. As an analysis process, for example, Message Passing Interface (MPI) is known, which is a communication library for parallel calculation.
For example, a distributed processing system that executes a MapReduce process and an MPI process has a master server operated as a name node and a job tracker, and a slave server operated as a task tracker and a data node as well as executing the MPI process. In the MapReduce process, each slave server executes the Map process, the Shuffle sort process, and the Reduce process on an input file in the comma separated value (CSV) format, converts the input file into a binary format, and writes the binary format file in a local file of each slave server. The data written in each local file is combined on the HDFS. Each slave server reads the binary data from the HDFS in the MPI process and executes a principal component analysis.
Related techniques are disclosed in, for example, Japanese National Publication of International Patent Application No. 2013-545169 and Japanese Laid-Open Patent Publication No. 2011-150503.
However, with the above technique, the amount of communication tends to be large when reading data in the MPI process. Therefore, an overall processing efficiency tends to be deteriorated.
For example, each slave server of the distributed processing system writes an execution result of the MapReduce process in a local data node in accordance with an instruction of the name node. Each slave server combines the data written in the local file of each slave server on the HDFS. Thus, when a slave server executes the MPI process, an event occurs in which the slave server acquires data to be processed via a network. The data acquisition via the network is greatly affected by a bandwidth and load of the network, which may lead to a processing delay.