The present invention relates generally to the field of parallel, distributed programming, and more particularly to execution optimization of parallel, distributed programs.
MapReduce is a generic programming model for processing parallelizable problems. MapReduce applications can process large data sets in parallel by coordinating the resources of a large number of physical and/or virtual computers, known collectively as a cluster or grid. In the MapReduce programming paradigm, a job is submitted for processing, which is then broken down into pieces known as tasks. These tasks are scheduled to run on the various nodes in the MapReduce cluster, with task assignments being made such that each node can work on its piece of the job in parallel with the work being done by other nodes.
As the name implies, each task in a MapReduce job is typically of one of two types: a map task or a reduce task. As a simple example, a MapReduce job might be to process all the words in a collection of books, counting the number of times each word occurs. A set of map tasks might be created, one for each book in the collection, with each task recording the frequency of occurrences of every word found in the book associated with that task. The output produced by these map tasks is then used as input to a set of reduce tasks. In this case, each word might have an associated reduce task, the job of which is to sum the frequencies of that word produced by all the map tasks. The distribution of work provided by MapReduce enables map tasks and reduce tasks to run on small subsets of larger sets of data, which both lowers processing latency and provides a high degree of scalability. Because of the potentially large size of MapReduce jobs and the ability to take advantage of custom-scaled processing, it may be attractive to run MapReduce jobs in a cloud environment (discussed further below).