Increasingly inter-connected, global computing systems are generating an enormous amount of irregular, unstructured data. Mining such data for actionable business intelligence can give an enterprise a significant competitive advantage. High-productivity programming models that enable programmers to write small pieces of sequential code to analyze massive amounts of data are particularly valuable in mining this data.
Over the last several years, MapReduce has emerged as an important programming model in this space. In this model, the programming problem is broken up into specifying mappers (map operation) and reducers (reduce operation). A mapper takes a small chunk of data (typically in the form of pairs of (key,value)), and produces zero or more additional key value pairs. Multiple mappers are executed in parallel on all the available data, resulting in a large collection of (key,value) pairs. These pairs are then sorted and shuffled. Briefly, moving map outputs to the reducers is referred to as shuffling. Another piece of programmer-supplied code (the reducer) is used to reduce the set of values associated with a given key. Multiple reducers operate in parallel, one for each generated key. There are many software applications that implement the MapReduce, for example, provide programming or software framework or application programming interfaces for allowing users to program the MapReduce functionality. An example is Apache™ Hadoop™. Hadoop™ MapReduce (HMR) is a software framework for writing distributed applications.
In HMR, input is usually taken from (and output is written to) a distributed, resilient file system (such as HDFS. a Hadoop file system). A partitioned input key/value (KV) sequence I is operated on by mappers to produce another KV sequence J, which is then sorted and grouped (“shuffled”) into a sequence of pairs of key/list of values. The list of values for each key is then operated upon by a reducer which may contribute zero or more KV pairs to the output sequence. If the involved data sets are large, they are automatically partitioned across multiple nodes and the operations are applied in parallel.
This model of computation has many remarkable properties. First, it is simple. The HMR application programming interface (API) specifies a few entry points for the application programmer—mappers, reducers/combiners, partitioners, together with input and output formatters. Programmers merely need to fill out these entry points with (usually small) pieces of sequential code. Briefly, partitioning refers to sending specific key/value pairs to specific reducers. A combiner receives outputs of mappers as inputs. Outputs of a combiner are sent to a reducer. A partitioner performs such partitioning.
Second, it is widely applicable. A very large class of parallel algorithms (on structured, semi-structured or unstructured data) can be cast in this map/shuffle/reduce style.
Third, the framework is parallelizable. If the input data sequence is large, the framework can run mappers/shufflers/reducers in parallel across multiple nodes thus harnessing the computing power of the cluster to deliver scalable throughput.
Fourth, the framework is scalable: it can be implemented on share-nothing clusters of several thousand commodity nodes, and can deal with data sets whose size is bounded by the available disk space on the cluster. This is because mappers/shufflers/reducers operate in streaming mode, thus supporting “out of core” computations. Data from the disk is streamed into memory (in implementation-specified block sizes), operated on, and then written out to disk.
Fifth, the framework is resilient. A job controller tracks the state of execution of the job across multiple nodes. If a node fails, the job controller has enough information to restart the computation allocated to this node on another healthy node and knit this new node into the overall computation. There is no need to restart the entire job. Within limits, of course, if there are a large number of failures, the job controller may give up. The job controller itself is a single point of failure, but techniques can be applied to make it resilient. Key to resiliency is that the programmer supplied pieces of code are assumed to be functional in nature, i.e., when applied to the same data the code produces the same result.
Because of these properties, the HMR engine is now widely used, both as a framework against which people directly write code (e.g., for Extract/Transform/Load tasks) and as a compiler target for higher-level languages.
The design point for the HMR engine is offline (batch) long-lived, resilient computations on large commodity clusters. To support this design point, HMR makes many decisions that have a substantial effect on performance. The HMR API supports only single-job execution, with input/output being performed against an underlying file system (HDFS). This means that if a higher level task is to be implemented with multiple jobs, each job in this sequence must write out its state to disk and the next job must read it in from disk. This incurs I/O cost as well as (de-) serialization cost. Mappers and reducers for each job are started in new JVMs (JVMs typically have high startup cost). An out-of-core shuffle implementation is used: the output of mappers is written to local disk; a file transfer protocol is used to move these files to their target nodes and an out-of-core sorting algorithm is used to group the records by key for the reduce phase.
As discussed above, the MapReduce programming model can be made resilient to failure. A monitoring process keeps track of all the mapper processes. If a mapper process fails, the monitoring process starts another process with the same input data that it had given the failed process. The JobTracker in Hadoop™ MapReduce framework is an example of such monitoring process. The JobTracker is responsible for distributing and scheduling the tasks and monitoring them. If a mapper process fails, the JobTracker starts another process (on a different node in the network) and gives it the same input chunk of data that it had given the failed process. Under the mild assumption that the same mapper code will produce the same result on the same input when run more than once, this new process will produce output identical to what the old process would have produced.
Map Reduce jobs are typically executed on one or more nodes running in a cluster. The Hadoop Map Reduce engine, for instance, implements a Map Reduce job as follows.
The client prepares a job configuration object specifying the classes to be used during execution, the number of reducers used to run the job, the location of the HMR jobtracker, etc. This configuration object is threaded throughout the program (and passed to user classes), and can hence be used to communicate information of use to the program. The job configuration object is submitted in a call to JobClient.submitJob. This library function obtains a jobid from the Hadoop jobtracker, and writes out the necessary job information to the jobtracker's filesystem (in a jobid-relative path), including the job configuration object and the user code to be run. The user's InputFormat is instantiated, and asked to produce InputSplits, metadata that describes where each “chunk” of input resides. These are also written out to the job's directory. Finally, the jobtracker is notified that a new job with the given jobid has been submitted.
FIG. 3 presents a high level view of the data flow for a single Hadoop job (each mapper and reducer box represents multiple processes). Dotted lines represent cheap in-memory communication. Solid black lines represent expensive out of memory (disk or network) operations. The jobtracker schedules the job to run, allocating map and reduce tasks on available task trackers. The map tasks (allocated close to their corresponding InputSplits) must next read input data. If the data is in HDFS (common case), reading requires network communication with the namenode (storing the file metadata). Reading the actual data requires file system I/O (which may not require disk I/O if the data is in kernel file system buffers), and may require network I/O (if the mapper is not on the same machine as the one hosting the data). The map tasks deserialize the input data to generate a stream of key/value pairs that is passed into the mapper. The mapper outputs key/value pairs, which are immediately serialized and placed in a buffer. While in the buffer, Hadoop™ may run the user's combiner to combine values associated with equivalent keys. When the buffer fills up, they are sorted and pushed out to local disk.
Once map output has been pushed out to disk, reducer tasks start fetching their input data. This requires disk and network I/O. Each reducer performs an out-of-core sort of its input data. After all of the mappers have completed and the data is sorted, each reducer starts processing its input. Each reducer outputs a (possibly empty) sequence of key/value pairs that is sent to the OutputFormat (and its attendant RecordWriter) for output. Typically, this involves serializing the data and writing it out to disk. The namenode is contacted to update the required file system metadata. Since Hadoop assumes that data nodes and compute nodes are co-located, writing out the actual data does not involve network communication.
The shuffle phase is the part of the MapReduce programming model that handles communication. To optimize this phase, the user may specify a combiner, which acts like a reducer but runs only on some fraction of the data emitted from a single mapper to a given reducer. The intention is to cut down the amount of data that must be transmitted over the network. The combiner may run additional times, to fold fresh input from the mapper into previously combined output. To implement this, the mapper outputs key/value pairs, which are immediately serialized and placed in an in-memory buffer. When this buffer reaches a certain size, Hadoop™ may run the user's combiner on the data in the buffer, to combine values associated with equivalent keys. To do this, it must deserialize the buffer into the original in-memory representations, run the combiner, then reserialize the combiner output back into the buffer to replace the original data. When the buffer fills with data that cannot be further combined, the key/value pairs are pushed out to local disk. These disk files are served by a daemon at each mapper. Each reducer contacts every mapper to gather together the pieces of its input that are distributed across the various local filesystems, aggregates the pieces, and presents the sorted result as input to the user's reduce code.
There are two performance implications of this design. Firstly, by supporting out-of-core execution, the combiner must operate on serialized byte buffers. Serializing and then deserializing this data wastes central processing unit (CPU) cycles. Secondly, Hadoop™ exposes multi-core execution by running many JVM instances. It is not possible to combine output across JVM instances. Although Hadoop allows re-use of JVMs from one task to the next, this is to avoid JVM init/teardown costs, there is no re-use of heap data between tasks.