Many companies and other organizations operate computer networks that interconnect numerous computing systems to support their operations, such as with the computing systems being co-located or instead located in multiple distinct geographical locations (e.g., connected via one or more private or public intermediate networks). For example, data centers housing significant numbers of interconnected computing systems have become commonplace, such as private data centers that are operated by and on behalf of a single organization and public data centers that are operated by entities as businesses to provide computing resources to customers. Some public data center operators provide network access, power, and secure installation facilities for hardware owned by various customers, while other public data center operators provide “full service” facilities that also include hardware resources made available for use by their customers. As the scale and scope of typical data centers has increased, the tasks of provisioning, administering, and managing the physical computing resources have become increasingly complicated.
Examples of such large-scale systems include online merchants, internet service providers, online businesses such as photo processing services, corporate networks, cloud computing services, web-based hosting services, etc. These entities may maintain computing resources in the form of large numbers of computing devices (e.g., thousands of hosts) which are hosted in geographically separate locations and which are configured to process large quantities (e.g., millions) of transactions daily or even hourly. Such large-scale systems may collect vast amounts of data that require processing.
One conventional approach to process data is the MapReduce model for distributed, parallel computing. In a MapReduce system, a large set of data may be split into smaller chunks, and the smaller chunks may be distributed to multiple compute nodes for an initial “map” stage of processing. Multiple nodes may also carry out a second “reduce” stage of processing based on the results of the map stage. The results from the map stage may be “shuffled” to other nodes to use as input for the reduce stage. In other words, the intermediate results may be reorganized and grouped differently for the reduce stage and may be transferred across a network to the reduce nodes. The use of network resources in this manner may be expensive, and the shuffle operation may be time-consuming.
Within a MapReduce implementation most computations involve applying a user-specified map operation to each logical “record” in the input in order to compute a set of intermediate key/value pairs, and then applying a user-specified reduce operation to all the values that shared the same key. Thus, the MapReduce computation takes a set of input key/value pairs and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: a map function and reduce function. The user-written map function receives a pair of input key-values pairs and produces a set of intermediate key/value pairs. The reduce function, also written by the user, receives and merges together the intermediate values
Although Java™ code is commonly used with MapReduce, virtually any programming language may be used to implement the map and reduce functions.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended examples. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the examples. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning “having the potential to”), rather than the mandatory sense (i.e., meaning “must”). Similarly, the words “include,” “including,” and “includes” mean “including, but not limited to.”