Collecting and analyzing increasingly large amounts of data is integral to the efficient operation of modern-day enterprises. Data-centric programming models like Apache Hadoop MapReduce or Apache Spark are commonly used for such data analyses tasks. Apache Hadoop project (hereinafter “Hadoop”) is an open-source software framework for developing software for reliable, scalable and distributed processing of large data sets across clusters of commodity machines. Hadoop includes a distributed file system, known as Hadoop Distributed File System (HDFS). HDFS links together the file systems on local nodes to form a unified file system that spans an entire Hadoop cluster. Hadoop can also be supplemented by other Apache projects including Apache Hive (hereinafter “Hive”) and Apache HBase (hereinafter “HBase”). Hive is a data warehouse infrastructure that provides data summarization and ad hoc querying. HBase is a scalable, distributed NoSQL (No Structured Query Language) database or data store that supports structured data storage for large tables.
MapReduce and Spark jobs typically include multiple tasks, each processing a partition of the overall input for the job. A cluster scheduler, like Apache Hadoop YARN or Apache Mesos allows sharing of cluster computing resources among several jobs, potentially from multiple users. Existing cluster schedulers (e.g. YARN) support a scheduling model based on resource requests. In other words, jobs submitted by users can include a request for certain resources (e.g. CPU, memory, etc.) needed to process the job. In turn, a cluster scheduler can allocate resources at nodes in a computer cluster when they become available. Such resource allocations are generally referred to as containers. The computing resources allocated within a given container are reserved exclusively for use within the given container and cannot be used by other containers, even if the allocated resources are not currently being utilized.
The amount of computing resources required to process a given task can be difficult to predict. It is inevitably difficult to accurately estimate the resource requirements of a job or its constituent tasks because: (i) resource usage of a task varies over time, and (ii) resource usage can vary across tasks of the same job based on the input they process. Users are expected to estimate and request the peak usage across all tasks to ensure job completion. This problem is further exacerbated by the fact that end-users can use convenience wrapper libraries like Apache Hive to create a majority of these jobs, and are consequently unaware of their characteristics. For these reasons, in practice, users end up using defaults, picking very conservative estimates of peak utilization (e.g. based on historical usage), or copying resource requirements from other work-flows that are known to work. The over-allocation of resources to process jobs and tasks leads to resource fragmentation and severe under-utilization of the computing resources in the cluster.