The present invention relates generally to the field of distributed parallel processing using MapReduce, and more particularly to optimizing intermediate result shuffle performance for reduce tasks.
Processing very large data sets can be greatly improved by using a large number of computers, or nodes, and distributing the work to be done by processing smaller blocks of data in parallel. The large number of nodes is collectively referred to as a cluster if all the nodes are on the same local network and use similar hardware. If the nodes are shared across geographically and administratively distributed systems and use heterogeneous hardware, the collective nodes are referred to as a grid. A framework model that processes data in this manner is MapReduce, which splits large data sets into small records of key-value pairs, such that the pairs can be processed in parallel. A MapReduce job is a schedulable object comprised of one or more stages of map tasks and reduce tasks, which are scheduled by a scheduler that is a system service of software component in a grid. In general there can be multiple stages of map-reduce-reduce tasks. An initial map stage contains multiple map tasks that read their inputs from initial data input sources and write their partitioned outputs to tasks of subsequent reduce stages. An intermediate reduce stage contains multiple tasks that act as reduce tasks to fetch their partitioned and shuffled inputs from task outputs of previous stages but act as map tasks to write their partitioned outputs to tasks of subsequent reduce stages. A final reduce stage contains multiple reduce tasks that fetch their partitioned and shuffled inputs from task outputs of previous stages and write their outputs to final data output sinks. For simple cases, there can be a single stage of map tasks and two overall stages, including a map stage of map tasks and a subsequent reduce stage of reduce tasks, where the map tasks' outputs are partitioned and shuffled to reduce tasks. Map task outputs are also called intermediate results if there is more than one stage. MapReduce aids in processing and analyzing large volumes of structured and unstructured data. Application examples include indexing and search, graph analysis, text analysis, machine learning, data transformation, and so forth. These types of applications are often difficult to implement using the standard SQL employed by relational database management systems (DBMSs).
Each computer node within the cluster or grid can run multiple mappers, multiple reducers, and a shuffler. A mapper or reducer is an operating software component that runs map tasks or reduce tasks, respectively. In case of multiple stages of map-reduce-reduce as described previously, the same software component can act as a reducer to fetch input data from the previous stages but act as a mapper to write output data to subsequent stages. A mapper or reducer may be reused to run more than one map task or reduce task, respectively. A shuffler is a system service of software component per computer node that functions to shuffle partition segments of map task outputs (intermediate results) as inputs to reduce tasks.
A map task processes input key/value pair data and generates an intermediate result comprised of partition segments, as an output that is also in the form of a key/value pair. The output key of a map task can be the same or different from the input key of the map task. The intermediate results are partitioned by the map task output key. The number of partitions equals the number of reduce tasks in the subsequent stages to which the intermediate results are shuffled, one partition per reduce task. Because the total size of intermediate results on a computer node can be greater than the physical memory size of the node, the intermediate results are serialized into files so that they can be stored on the disks for reduce tasks to fetch at their pace and time.
Reduce tasks process the intermediate data results. Because one reduce task needs to process its corresponding partition of the intermediate results from multiple map tasks, a piece of data fetched for a reduce task from one map task output is called a partition segment. A reduce task needs to fetch a collection of such segments for its partition from every map task in the job. The data partition segments are shuffled from map tasks to the reduce tasks, which may run on different computers than those on which the map tasks run. The reducer of a reduce task fetches segments of its partition from every corresponding map task in the job and processes the fetched intermediate results to generate their results.
For load balancing purposes, fetch requests from a reducer to a shuffler come in rounds in which a reducer fetches up to a configurable number of segments of its partition from one shuffler, and then requests a fetch from the next shuffler, and so on, in a round-robin or random sequence. The reducer requests a fetch from each shuffler of nodes in the grid that are processing data for the same job.
A shuffler receives a fetch request from a reducer which includes the specific job ID, the reduce task ID which corresponds to the partition ID, and the map task ID that produced the intermediate results contained in the segment to be fetched for the reduce task. The shuffler responds to the reducer's fetch request to shuffle the intermediate results output by the map tasks.
If the intermediate results of map tasks have been written to a file, the operating system (OS) may initially cache the results in its page cache in memory, but as additional intermediate results are generated, the OS may have to write the cached results to disk and clean up the memory for other uses. If the reducer requests a fetch of the intermediate results that have been written to disk and flushed (cleaned up) from memory, the shuffler has to read the data from the disk, and send it to the reducer, which is significantly slower than reading the results from memory.