Map-reduce or MapReduce is a software framework for computing distributable problems using a large number of computing nodes, collectively referred to as a cluster. In the “map” step, a master node takes the input, divides it into smaller sub-problems, and distributes the sub-problems to worker nodes. The worker node processes that smaller problem, and passes the answer back to its master node. In the “reduce” step, the master node takes the answers to all the sub-problems and combines them in a way to get the output—the answer to the problem it was originally trying to solve. The reduce operation can be executed in parallel over partitions of data. A map-reduce operation typically utilizes parallelism for both the map and reduce steps.
FIG. 1 illustrates processing operations 100 associated with map-reduce. Input data 105 is mapped 110 into individual tasks 115, 120, 125, which are subsequently executed. A reduce function 130 combines the results to produce output data 135.
FIG. 2 illustrates the implementation of these processing operations in a network 200. A client 205 specifies input data, which may be passed over a local area network 215 to a master host 210. The master host 210 produces a query plan specifying the map and reduce operations. Individual tasks are distributed to a set of segment hosts 225, 230, 235 and 240 via an interconnect 220. The segment hosts compute their tasks and reduce results. A final output may be passed to client 205, if specified by the output specification.
The advantage of map-reduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel—though in practice it is limited by the data source and/or the number of nodes near the data. Similarly, a set of “reducers” can perform the reduction phase—all that is required is that all outputs of the map operation that share the same key are presented to the same reducer at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, map-reduce can be applied to significantly larger datasets than that which typical servers can handle. The parallelism also offers some possibility of recovering from partial failure of servers or storage. That is, if one mapper or reducer fails, the work can be rescheduled, assuming the input data is still available.
One problem with existing map-reduce implementations is that a common source format is required. Therefore, different forms of data are normalized to the common source format. For example, one may need to export data from a relational database into files or vice versa to achieve a common source format. It would be desirable to directly operate on a data source in its native format.
Another problem with existing map-reduce implementations is that a programmer shoulders the burden of data management operations. For example, data access routines must be specified Similarly, remote connectivity and coordination between nodes must be specified. A single programmer typically does not have all of the skills required to specify an efficient query plan. For example, map-reduce operations are commonly implemented by general software developers working with files, while database processing operations are commonly implemented by enterprise application programmers with expertise in accessing transactional records using a query language, such as Structured Query Language (SQL). It would be desirable to remove barriers between programming styles and expertise so that a single programmer could effectively implement map-reduce operations.