In recent years, in an increasing number of cases, large volumes of data are processed. Processing capacity of information processing devices used for data processing has been also greatly improved. However, when a single information processing device is caused to perform processing of large volumes of data, it takes a very long time to perform the processing. Therefore, when it is required that processing of large volumes of data is performed in a shorter time, normally, distributed processing in which a plurality of information processing devices are caused to perform data processing in parallel is employed.
Currently, in an increasing number of cases, a plurality of virtual machines (VMs) are created on a high-performance information processing device. This is because the following advantages are achieved: virtual machines do not interfere one another; any operating system (OS) and software may be employed; and the number of information processing devices that are to be used may be reduced. Each virtual machine may be used as a single information processing device (a node) that is caused to perform distributed processing.
In distributed processing, data is allocated to each node, and processing that is to be performed using the allocated data is specified. As such a distributed processing platform, for example, Hadoop may be used.
Hadoop is an open source implement of MapReduce, which is a distributed parallel processing framework, and a distributed file system (Hadoop File System or HDFS). In Hadoop, data is divided into data blocks and nodes are divided into master nodes and slave nodes.
A master node determines a task that is to be allocated to each slave node and requests the slave node to process the determined task. Thus, actual data processing is performed by the slave node. Therefore, in distributed processing using Hadoop, increase in data volume may be addressed by increasing the number of slave nodes.
In a VM environment in which virtual machines are created on a plurality of information processing devices, normally, each of the master node and the slave node is built on a single virtual machine. Unless specifically stated otherwise, both of the “master node” and the “slave node” shall herein be used as a term representing a “node” built on a virtual machine.
Each slave node (task tracker) executes a task (processing) by using allocated data. In order to obtain allocated data, in many cases, data communication between information processing devices is performed. As the volume of data that is to be communicated between information processing devices increases, the processing time required for completing execution of a task increases. From this reason, normally, in scheduling (which herein includes placement (allocation) of data) in which a task is allocated to each slave node, it is taken into consideration that a time spent for communication between the information processing devices is reduced. By reducing the time spent for communication between the information processing devices, a processing time of entire distributed processing may be also reduced.
In reality, even when data block placement and task allocation are optimally performed on each slave node, the processing time of entire distributed processing is long. In future, it is expected that, in many cases, a VM environment will be used for distributed processing. Data that is to be processed will be certainly increased in size in the future. Therefore, it will be important to enable distributed processing that is to be executed in a VM environment, to be performed at higher speed.
Related techniques are disclosed in Japanese Laid-open Patent Publication No. 2010-218307, International Publication Pamphlet No. WO 2008-062864, Japanese Laid-open Patent Publication No. 2012-108816, and Japanese Laid-open Patent Publication No. 2012-198631.