The analysis of massive amounts of data is becoming a routine activity in many commercial and academic organizations. Internet companies, by way of example only, collect large amounts of data such as content produced by web crawlers, service logs and click streams. Analyzing these data sets may require processing tens or hundreds of terabytes of data. Such data sets are known to be referred to as “big data.” A data set characterized as big data is prohibitively large such that it is beyond the capabilities of commonly used software tools to manage/process the data, or at least to do so within a reasonable time frame. To perform the analysis tasks, researchers and practitioners have been developing a diverse array of Massively Distributed Computing Platforms (MDCP) running on large clusters of commodity machines (nodes). Examples of such platforms include MapReduce from Google™ and its open-source implementation Hadoop, Dryad from Microsoft™, MPP Database from Greenplum™ and Spark from University of California at Berkley AMPLab.
The node cluster where a MDCP resides represents a limited set of resource elements, such as central processing unit (CPU), memory, disk and network, which fuel the application programs running in the MDCP. Conceptually, a MDCP application includes one or multiple jobs, which in turn are composed of short tasks. Each task is typically executed at a dedicated node. In order to optimize the performance of MDCP applications, the resources of the cluster need to be managed both effectively and efficiently.