The present invention relates generally to distributed networks, and more specifically, to management of a map reduce service on a distributed network.
Large data processing jobs require the availability of commensurately large computational, networking, and storage resources. An example of a data processing technique that is applied to relatively large data sets is the map reduce paradigm. Map reduce is a programming model for processing large data sets using a parallel algorithm on a cluster of computers. Map reduce allows scalability over hundreds or thousands of nodes that process data in parallel. Numerous nodes comprising relatively inexpensive, lower-capability resources, as opposed to nodes comprising a few expensive and specialized resources, may be used for parallel processing of such large data processing jobs. The parallel execution can be in lockstep or in a loosely parallel manner. The nodes can be in close proximity (e.g., on the same network and in the same building) and use near-identical hardware, in which case the nodes may be referred to as a cluster; or, the nodes can be geographically distributed and use more heterogenous hardware, in which case the nodes constitute a grid.
The map reduce framework includes two distinct phases: the map function and the reduce function. The map function takes input data organized as (key, value) pairs. For a data pair with a type in one domain, the map function transforms the data pair into a list of pairs in another domain. The map function is applied in parallel to every data pair in the input dataset, producing a list of pairs for each call. After the map function is complete, the overall framework collects all pairs with the same key from all lists, and groups them together, creating one group for each key. The reduce function is then applied in parallel to each group, which produces a collection of values in the same domain. The reduce function results are collected as the desired result list. Thus the map reduce framework transforms a list of (key, value) pairs into a list of values. One example of a typical map reduce job is to take an input data set comprising series of sensor data giving the maximum daily temperature over a month in a set of cities. In order to find the maximum temperature for each city across all of the data files for the month, map reduce is applied as follows: assign as many map tasks as the number of files, and each map task finds the maximum temperature for each city listed in its input file over the one month. Then, the reduce step gathers all data for each city (i.e., the cities are the keys for the reduce function) from the map task outputs into a group, and determines the maximum temperature for each city over the month from the data group for that city. The output after completion of the reduce step is a list of cities, and the maximum temperature for each city over the one month.
Data processing may be performed on any sort of computing or entertainment devices, such as desktop computers, game consoles, tablet computers, smart phones, set-top boxes, and internet protocol (IP) streaming devices. A typical household might have any number each of any combination of such devices. The devices may vary widely in terms of compute, memory, storage and networking capabilities and may have relatively low utilization (i.e., be idle most of the time). Further, such devices are capable of internet connectivity, and are not turned off when not in use, so that the devices are connected to the network most of the time.