MapReduce is a well-known programming model that enables development of scalable parallel applications to process a large amount of data on clusters of commodity computing machines. In general, through an interface with two functions, a map function and a reduce function, the MapReduce model can facilitate parallel implementation of many real-world tasks such as data processing for search engines.
A commodity machine (or a portion of a commodity machine) performing map functions may be referred to as a map node. Likewise, a commodity machine (or a portion of a commodity machine) performing reduce functions may be referred to as a reduce node. It is possible for a commodity machine to perform both map functions and reduce functions, depending on the required computations and the capacity of the machine. Typically, each map node has a one-to-one connection with a corresponding first-level reduce node. Multiple levels of reduce nodes are generally required for processing large data sets.
An initial large dataset to be processed in a MapReduce model is generally a fixed or static data set. A large fixed dataset is first divided into smaller data sets by the map nodes. The smaller data sets are then sent to first level reduce nodes for performing a reduce function on the smaller data sets. The reduce functions generate a smaller set of values which will be re-reduced at the next levels until a final result or measurement is attained. A visual representation of a MapReduce architecture may comprise nodes arranged in a funnel shape wherein an initial data set is incrementally reduced at each level until a final result exits the tip of the funnel.
It would be beneficial to provide a system capable of distributed processing of dynamic or streaming data without needing multiple levels of reduce nodes.