MapReduce is the name of several software frameworks used to support scalable distributed processing on large data sets stored in a file system over a large set of computing nodes of a distributed processing system. Many enterprises rely on timely analysis of the MapReduce framework and its open-source implementation Hadoop as a platform choice for efficient processing and advanced analytics over large amounts of unstructured information.
MapReduce includes map and reduce functions that are defined differently than those terms are understood in functional programming. As part of a map function, a master node receives an input, divides the input into smaller projects and distributes the projects to the worker nodes. The worker nodes process the projects and return the answer to the master node. As part of the reduce function, the master node collects the answers and combines them to provide an output. Map and reduce functions are performed using different types of resources including map and reduce slots that execute map and reduce tasks respectively. The MapReduce model includes a barrier between map and reduce stages. The reduce stage is executed after the map stage is completed. Thus, the execution of consecutive jobs in a MapReduce environment is pipelined. Once a first job finishes its map stage, the second job can start its map stage such that the reduce stage of the first job can overlap with the map stage of the second job.