Massive parallel distributed data processing systems are becoming common place in data extraction, transformation and loading (ETL) functions used to support data analytics operations at today's large online organizations. One such system developed by Google™ uses a MapReduce programming model for processing and generating large data sets. MapReduce is a programming methodology to perform parallel computations over distributed (typically, very large) data sets. Some theory regarding the MapReduce programming methodology is described in “MapReduce: Simplified Data Processing on Large Clusters,” by Jeffrey Dean and Sanjay Ghemawat, appearing in OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, Calif., December, 2004 (hereafter, “Dean and Ghemawat”). A similar, but not identical, presentation is also provided in HTML form at the following URL: http://labs.google.com/papers/mapreduce-osdi04-slides/index. html (hereafter, “Dean and Ghemawat HTML”).
Basically, a “map” function maps key-value pairs to new (intermediate) key-value pairs. A “reduce” function represents all mapped (intermediate) key-value pairs sharing the same key to a single key-value pair or a list of values. The “map” and “reduce” functions are typically user-provided. The map function iterates over a list of independent elements, performing an operation on each element as specified by the map function. The map function generates intermediate results. The reduce operation takes these intermediate results via a single iterator and combines elements as specified by the reduce function.