MapReduce is a programming methodology to perform parallel computations over distributed (typically, very large) data sets. Some theory regarding the MapReduce programming methodology is described in “MapReduce: Simplified Data Processing on Large Clusters,” by Jeffrey Dean and Sanjay Ghemawat, appearing in OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, Calif., December, 2004 (hereafter, “Dean and Ghemawat”). A similar, but not identical, presentation is also provided in HTML form at the following URL: http://labs.google.com/papers/mapreduce-osdi04-slides/index.html (hereafter, “Dean and Ghemawat HTML”).
Basically, a “map” function maps key-value pairs to new (intermediate) key-value pairs. A “reduce” function represents all mapped (intermediate) key-value pairs sharing the same key to a single key-value pair or a list of values. The “map” and “reduce” functions are typically user-provided. The map function iterates over a list of independent elements, performing an operation on each element as specified by the map function. The map function generates intermediate results. The reduce operation takes these intermediate results via a single iterator and combines elements as specified by the reduce function.