Large-scale data processing generally involve extracting data of interest from raw data contained in one or more input datasets and processing the extracted data into a useful product. One known technique of data processing is MapReduce, which refers to a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on distributed servers (nodes), collectively referred to as a cluster. A MapReduce program is generally comprised of a Map( ) procedure that performs filtering and sorting, and a Reduce( ) procedure that performs a summary operation. The MapReduce framework orchestrates the data processing by marshalling the distributed servers, running various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.
The MapReduce framework is traditionally executed on large clusters of hardware computer servers (i.e., on the order of hundreds or thousands of machines) connected together with a switched Ethernet, in which machine failures are common and even expected. For example, each worker node may be a thread or process executing on a commodity personal computer (PC), having directly-attached, inexpensive hard drives. However, such specialized implementations are unorthodox compared to traditional data centers and may be costly to difficult, particularly for users with light to medium data processing needs.