1. Field
The present disclosure relates to parallel computing technology for processing a large amount of data, and more particularly, to a system and method for accelerating a mapreduce operation.
2. Discussion of Related Art
“Mapreduce” is a software framework developed to support distributed computing for processing a large amount of data. This framework has been developed to support parallel processing of petabytes of data, or more, in a cluster environment, i.e., an environment in which a plurality of computers cooperate to perform one or more processing tasks.
A current system for processing a mapreduce operation is configured to perform a mapreduce operation through a (a) map operation process of generating a key-value pair from original data, and a (b) reduce operation process of converting the generated key-value pair into another key-value pair. In this process, after intermediate data of each operation step is (1) stored in a local file system of a node constituting a distributed computing system, (2) transmitted to a node necessary for the operation, and (3) stored in a local file system of the corresponding node, the node performs a next operation.
Such a mapreduce operation method has an advantage in that it is possible to process a large amount of data while minimizing data movement on a network, but also has a drawback in that the performance deteriorates due to the following five performance bottlenecks occurring in the operation process.                Performance bottleneck caused by low disk input/output (I/O) speed when a map node reads original data to be used in a map operation from a local file system,        Performance bottleneck caused by low disk I/O speed in a process of storing temporary key-value pair data generated after the map operation in the local file system before the temporary key-value pair data is transmitted to a reduce node,        Network delay occurring in a process of transmitting a data block present in the local file system through a network when the data is transmitted to the remote reduce node,        Performance bottleneck caused by low disk I/O speed in a process of temporarily storing the map key-value pair data transmitted to the reduce node in a local file system of the reduce node, and        Performance bottleneck of disk I/O occurring in a process of merging and storing the map key-value pair data in the local file system again in the form of a data block for the purpose of a reduce operation.        