MapReduce and Hadoop are software frameworks used for the processing of a job in a distributed computing environment. MapReduce is described in U.S. Pat. No. 7,650,331 titled “System and method for efficient large-scale data processing.” At its core, a MapReduce is centered around two functions: a map function and a reduce function. The framework converts the input data (i.e., the job) into key and value pairs ([K_1,V_1]) which the map function then translates into new output pairs ([K_2,V_2]). The framework then groups all values for a particular key together and uses the reduce function to translate this group to a new output pair ([K_3,V_3]).
Hadoop is implemented in conjunction with MapReduce to introduce the concept of “rack awareness” into the distributed computing architecture. Specifically, Hadoop allows for the geographical proximity of the various worker nodes, and resources associated therewith, to be taken into account when sub-jobs are distributed. By minimizing the geographical distance between worker nodes and data nodes, network bandwidth is conserved. (See <http://hadoop.apache.org/>.)
In modern distributed computing architectures, such as Infrastructure-as-a-Service (IaaS) or “cloud” systems, users rent compute cycles, storage, and bandwidth with small minimum billing units (e.g., an hour or less for compute and per-MB for storage and bandwidth) and almost instant provisioning latency (minutes or seconds). These systems contrast with traditional data center co-location centers (colos), where equipment leases span months and provisioning resources can take days or longer. In such cloud systems, the efficient usage of computing resources results in less rental costs.