Various systems exist to perform analysis on very large data sets (e.g., petabytes of data). One such example is a Map Reduce distributed computing system for large analytic jobs. In such a system, a master node manages the storage of data blocks in one or more data nodes. The master node and data nodes are server computers with local storage. When the master node receives a processing task, the master node partitions that task into smaller jobs, where the jobs are assigned to the different subordinate (data) nodes. This is the mapping part of Map Reduce, where the master node maps processing jobs to the subordinate nodes.
The subordinate nodes perform their assigned processing jobs and return their respective output to the master node. The master node then processes the different output to provide a result for the original processing task. This is the reducing part of Map Reduce, where the master node reduces the output from multiple subordinate nodes into a result. Map Reduce is often used by search engines to parse through large amounts of data and return search results to a user quickly and efficiently. One example of a Map Reduce system is the Hadoop™ framework from Apache Software Foundation, also called the Hadoop™ Distributed File System (HDFS).
The HDFS framework relies on data replication to provide increased reliability. For example, if one data node fails to operate, the data can be accessed from another data node. The master node commands that multiple copies of the data be made, and the data nodes comply by performing a server-to-server replication.
In one example server-to-server replication process, a data node that has a copy of the data sends the data over a network (e.g., a layer 2 connection, such as Ethernet) to another node that saves the data in its own local storage. However, the amount of data to be copied can be quite large, which consumes network bandwidth. Additionally, conventional Von Neumann processor architecture passes the data through the processor so that large data transfers in systems with such processors can consume large amounts of computer processing cycles, computer bus bandwidth and computer memory as well. Thus, keeping additional copies of data may increase reliability, but it also has a cost in bandwidth and processing power. Conventional distributed processing systems often incur too much cost in providing data replication.