The landscape of data intensive processing has evolved significantly. Such processing has now become much more pervasive and is accessible to a broader user population. Several factors are responsible for this development. First, there is tremendous growth in the volume of available data resulting from the proliferation of devices. Second, the data storage costs have reduced dramatically making it cost-effective for institutions and individuals to retain large volumes of data. Third, new programming paradigms, such as Map-Reduce and Pig, have emerged that enable efficient processing of large data sets on clusters of commodity hardware. Open-source implementations of these paradigms such as Hadoop have further promoted this trend.
Commodity computing is a key enabler in the development and success of large-scale data analytics in a Cloud environment. This paradigm enables “scaling out” by adding inexpensive computing nodes (machines) as a solution to the scalability problem. This has resulted in frequent failures that have become a rule rather than an exception in typical Cloud environments. For example, in the context of data analytics, Google Inc. has reported at least one disk failure in every run of a 6-hour Map-Reduce job on a cluster of 4,000 machines. Not surprisingly, fault tolerance is considered a primary goal in the design and development of middleware and application software that processes data on such a large scale. The performance degradation resulting from failures as well as the cost for handling such failures depends on the nature of the application and its corresponding requirements.
Replication is one mechanism that has been widely used to improve data availability in data-intensive applications. The availability of intermediate data is important to the performance of dataflows, since lost intermediate has to be regenerated for the dataflow to advance. Therefore, in order to recover from a single failure, multiple stages that were previously executed in the dataflow may have to be re-executed.