MapReduce is a programming methodology to perform parallel computations over distributed (typically, very large) data sets. Some theory regarding the MapReduce programming methodology is described in “MapReduce: Simplified Data Processing on Large Clusters,” by Jeffrey Dean and Sanjay Ghemawat, appearing in OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, Calif., December, 2004 (hereafter, “Dean and Ghemawat”). A similar, but not identical, presentation is also provided in HTML form at the following URL: http://labs.google.com/papers/mapreduce-osdi04-slides/index.html (hereafter, “Dean and Ghemawat HTML”).
FIG. 1 simplistically illustrates the architecture of a map-reduce system 100. Basically, a “map” function 102 maps key-value pairs to new (intermediate) key-value pairs. A “reduce” function 104 represents all mapped (intermediate) key-value pairs sharing the same key to a single key-value pair or a list of values. The “map” function 102 and “reduce” function 104 are typically user-provided.
In general, a map function (which may actually be a group of map functions, each operating on a different computer) iterates over a list of independent elements, performing an operation on each element as specified by the map function. The map function generates intermediate results. A reduce operation takes these intermediate results via an iterator and combines elements as specified by the reduce function.
It is useful to consider that the data within a map-reduce system may be thought of as being characterized by key/value pairs. For example, both the input dataset and the output of the reduce function may be thought of as a set of key value pairs. The programmer specifies the map function, to process input key/value pairs and produces a set of intermediate pairs. The set of intermediate pairs is not explicitly represented in FIG. 1. The reduce function combines all intermediate values for a particular key and produces a set of merged output values for the key, usually just one.
While the map function and reduce function have been discussed above as being a single map function, the map function may, in implementation, be accomplished by multiple map sub-functions, each of the multiple map sub-functions operating on a different split of the input dataset. In any case, however, the input data set is homogeneous in that the entire input dataset is characterized by a schema according to which all of the multiple map sub-functions operates. Similarly, even if multiple reduce sub-functions operate on different partitions of the mapper output(s), the intermediate data is set is homogeneous in that the entire intermediate data set is characterized according to a schema according to which all of the reduce sub-functions operate.