The present invention relates generally to the field of database systems, and more specifically to database systems that follow a MapReduce framework.
MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers. The model is inspired by the “map” and “reduce” functions commonly used in functional programming. MapReduce comprises a “Map” step wherein the master node establishes a division of a problem in map tasks that each handle a particular sub-problem and assigns these map tasks to worker nodes. For this, a scheduling master splits the problem input data and assigns each input data part to a map task. An input part is often referred to as a split. The worker nodes process the sub-problems according to a map( ) function provided by a user, and notify the master node upon map task completion. MapReduce further comprises a “Reduce” step wherein the master node assigns a “reduce” operation to some worker nodes, which collect the answers to all the sub-problems and analyze them, using a reduce( ) function provided by the user, to form the output—the answer to the problem it was originally trying to solve.
MapReduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, the maps can be performed in parallel. Similarly, a set of ‘reducers’ can perform the reduction phase. While this process can appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than “commodity” servers can handle—a large server farm can use MapReduce to sort a petabyte of data in only a few hours; MapReduce is typically suited for the handling of ‘big data’. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled—assuming the input data is still available.
A significant design challenge associated with large complex systems that run MapReduce jobs is the efficient utilization of system resources, principally CPU cycles and memory, on a spectrum of jobs that vary greatly in their size and nature.