Advances in computer networks and their connectivity have resulted in increased popularity of distributed computer systems. Such systems permit a user operating a low-performance local machine to leverage the vast resources of computer clusters and grids that make up a typical distributed computer system. The computers or machines in such distributed systems are conventionally referred to as nodes.
In particular, the user can access data in the network and perform computationally intensive operations on it. Frequently, the data that is processed in such distributed environments is also spread across the nodes belonging to the network. In other words, the data is stored across various storage resources available to the network nodes in the form of a distributed file system (DFS).
One approach to handling vast amounts of distributed information for large-scale data analytics involves the use of batch jobs. Of these, the most popular are map-reduce jobs that are supported within Hadoop clusters. Map-reduce is a relatively young framework that allows a user to specify a processing logic by designing their own map and reduce operations or functions. The map and reduce operations can be written in a general-purpose language (e.g., Java or Python). This makes the framework relatively user-friendly.
A map-reduce job is performed on input files that exhibit a certain minimum structure. In particular, suitable input files are commonly formatted in key-value pairs. The value portion of each pair is usually some static data, i.e., not a program, and it may contain logs, database entries or general list entries.
Map-reduce itself consists of several phases. A job tracker that runs on the cluster's master node manages the entire map-reduce job. During the map phase, the input data in the form of key-value pairs is split into a number of data splits. The splits are scheduled by a task tracker to map nodes. The latter apply the user-defined map operations to the splits. Generally, the map operations are run in multiple waves and they produce a large amount of intermediate data. Many operations, such as collect, spill and merge have to be performed, frequently in multiple rounds, during the map phase to deal with the large amounts of intermediate data generated in applying the map operation over large amounts of input data. All of these operations constitute the pre-shuffle phase of a map-reduce job.
In the next phase, which is most frequently referred to as the shuffle phase, the intermediate data is transferred from the map nodes to reduce nodes. The shuffle phase is the most intense period of network traffic and is typically an all-to-all (or many-to-many) type operation. In fact, the shuffle phase often stresses the bandwidth of the network interconnections.
The final phase of map-reduce involves merging the sorted fragments of intermediate data obtained from the different map nodes to form the input for the reduce nodes. The latter apply the user-specified reduce operation to this input to produce the final output data. The typical output is in the form of a list that may be further compressed and written back to the DFS (e.g., Hadoop DFS or HDFS).
Many skilled artisans have recognized that it is the shuffle phase, rather than the pre- and post-shuffle phases, that presents a bottleneck in the map-reduce framework. For this reason, many of them have studied this phase and proposed various methods for quantifying the dataflow and ameliorating the intense traffic. For example, Herodotou H., “Hadoop Performance Models”, Technical Report, Duke University CS Dept., May 2011, pp. 1-19 teaches a number of mathematical performance models for describing the execution of map-reduce jobs on Hadoop. The goal is to estimate performance and find optimal configuration settings when running map-reduce jobs.
Furthermore, methods for optimizing the management of intermediate data in map-reduce jobs are also discussed by Moise D., et al., “Optimizing Intermediate Data Management in MapReduce Computations”, CloudCP 2011, 1st. Intl. Workshop on Cloud Computing Platforms, ACM SIGOPS Eurosys 11 Conference, Apr. 1, 2011. The same group also teaches the application of BlobSeer as a tool for storage backend in map-reduce jobs to enable higher throughput. The corresponding teaching is provided by Nicolae B., et al., “BlobSeer: Next Generation Data Management for Large Scale Infrastructures”, Journal of Parallel and Distributed Computing, 71, 2, Aug. 24, 2010, pp. 168-184. Still others teach alternative methods for pre-fetching and/or pre-shuffling of data in order to alleviate the traditional network traffic bottlenecks encountered during the shuffle phase of map-reduce.
Yet another approach to optimizing the shuffle phase involves making an appropriate selection of storage resources for the intermediate data. Such selection, as noted by others, becomes especially important when the storage resources available to the cluster are heterogeneous. In response to this problem, Kim M. and Shim K., “Shuffling Optimization in Hadoop M/R”, Fall CS 492 Presentation, South Korea, Dec. 15, 2008, pp. 1-13 teach the addition of an in-memory file system for storing certain intermediate data. In other words, rather than writing that intermediate data to a local disk file system, it is kept in an in-memory file system.
Although much effort has been devoted to finding methods for managing intermediate data during the shuffle phase, there is a need for further improvement. Many of the present solutions present speed-ups of just a few percent and encounter limitations when implemented in practice on data of various degrees of importance or popularity.