MapReduce is a programming model and an associated implementation for processing and generating large data sets and which is applicable to a broad class of compute tasks. A map function processes input data to generate sets of intermediate key/value pairs and a reduce function that merges all intermediate values associated with the same intermediate key. There are also multiple MapReduce infrastructures that implement the MapReduce framework. Some of the infrastructures are self-deployed, requiring the user to own the vast number of computers required; but most new implementations are hosted, such that the computational resources are borrowed via a web service for the time span of the compute job.
MapReduce is being applied to an increasing range of “big data” analysis tasks in fields involving web traffic, advertising, financial data, medical research, census data, etc.
In the MapReduce paradigm, the user supplies computer code for two simple algorithms—Map and Reduce—that are specific to the compute task. The MapReduce infrastructure then deploys the custom Map and Reduce algorithms on a large number (e.g., thousands) of machines, monitoring them to completion and restarting as needed, and delivering, sorting, and collecting input/output data from the machines as required by the typical MapReduce paradigm.
The compute job is said to be “sharded,” with “shards” (portions) of data being fed to compute shards (machines (e.g., computer systems or devices) executing multiple instances of the custom Map and Reduce algorithms).
A MapReduce infrastructure allows the programmer to accomplish the custom compute task by programming the custom and relatively simple Map and Reduce algorithms, without needing to manage the execution and coordination of the large number of required computers.
The MapReduce framework is described by Jeff Dean and Sanjay Ghemawat in their paper titled “MapReduce: Simplified Data Processing on Large Clusters” (published in OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, Calif., December 2004), which is available at research.google.com/archive/mapreduce.html and which is hereby incorporated by reference in its entirety. The Apache Software Foundation developed and offers an open-source MapReduce implementation called Hadoop.
Today, several hosted MapReduce services exist as businesses. The most prominent is Amazon Web Service's Elastic MapReduce (Amazon EMR), which offers Hadoop within its hosted compute service. Information regarding Amazon EMR can be found at the following URL: aws.amazon.com/elasticmapreduce/.
Other hosting companies with large data centers, such as Google of Mountain View, Calif. and Microsoft of Redmond, Wash., are offering their own hosted MapReduce solutions.
MapR Technologies, Inc. of San Jose, Calif. offers a hosted Hadoop infrastructure on Google's compute engine (www.mapr.com/).
Hadapt, Inc. of Cambridge, Mass. offers a hosted Hadoop infrastructure and connects it to a relational database model (www.hadapt.com).
Datameer, Inc. of San Mateo, Calif. is another Hadoop integrator (www.datameer.com/).
There also exist some interpreted and/or in-browser MapReduce implementations, but these are mostly demonstration or teaching tools. Unlike the described disclosed system, these are not backed by an actual scaled MapReduce infrastructure that can be applied to high-volume compute jobs or to managing or debugging those jobs. An example is mapreduce-js, described as an “educational framework” (code.google.com/p/mapreduce-js/).