Many parallel data processing systems, such as Hadoop, are architected using the master-worker design pattern and are comprised of name nodes and data nodes. In the master-worker design pattern, the name node assumes the role of master and coordinates all analytic processing sub-tasks among data nodes. A data node serves as a worker takes one sub-task and analyzes a subset of the data file. The results generated by each data node are combined through a series of steps to generate a final result.
Even though each data node will only typically process a subset of the data file, the complete data file is deployed to each of the data nodes to achieve redundancy, parallelism, and reliability. The process for deploying the data file starts by striping the data file into multiple chunks of data blocks. These data blocks are then transmitted from the data file source to the first data node which then stores them in its storage. The data is then propagated to the next peer data node which stores them in its storage. This process is repeated in a pipeline fashion until the data has been deployed to all data nodes.
Assuming that the data is deployed to N data nodes, the total cost of deploying the data is as follows. First, the first data node generates storage traffic by transfer data blocks from the data source. The first data node then generates storage traffic to transfer the data blocks through the storage fabric switches to the target storage device. The target storage device then writes the received data blocks to the storage. The storage device then sends a response back through storage fabric to the data node to indicate the status of the write. Finally, the first data node then generates network traffic by sending the data blocks to the next data node over the network. The process is repeated until the data has been deployed to N data nodes.