The disclosed embodiments relate generally to methods and apparatus for storing data from Hadoop to a separate storage system.
MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. The model is inspired by the map and reduce functions commonly used in functional programming. MapReduce is often applied to perform distributed computing on clusters of computers. One popular free implementation of MapReduce is provided via Apache Hadoop.
MapReduce is a framework for processing parallel problems across huge datasets using a large number of computers (e.g., nodes). The nodes may be collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data.
The MapReduce functionality is provided via two distinct steps: a map step and a reduce step. In the map step, a master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. In the reduce step, the master node then collects the answers to all the sub-problems and combines them in some way to form the output—the answer to the problem it was originally trying to solve.
Data that is generated via the map and reduce steps of Hadoop is typically stored in the Hadoop Distributed File System (HDFS). The HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. HDFS stores large files across multiple machines. With the default replication value, 3, data is stored on three data nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high.