Multi-stage or dataflow programming frameworks such as MapReduce and Pig has gained enormous popularity for large-scale parallel data processing in the large couple of years. Websites such as Facebook, The New York Times, Yahoo!, and many others use Hadoop, an open-source implementation of MapReduce, for various data processing needs. As a matter of fact, there is an increasing need for adding multi-purpose data-analytic capabilities to non-traditional data-intensive applications. For example, several online applications, such as financial applications, demand the ability to collect and process large amounts of data sets in an ad-hoc manner without having to submit processing jobs in batch mode to a data warehouse in an online manner.
Two factors have played a catalyzing role in this trend. First, the low economical and technological entry barriers of computational (data processing) open source tools like Hadoop; and second, the decreasing price of storage capacity. As such trends increase and gain economical significance by helping service providers differentiate from their competitors, the vision of data-analytic Clouds that can manage resources efficiently to support/execute dataflows while meeting SLA requirements seems more realizable.