Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. Accordingly, large data centers acting as clouds now power many of the high traffic services utilized today, including large search engines, social networks, and e-commerce websites.
Distributed resource management plays a key role in managing the applications (or, “frameworks”) enabling services in such data centers. For example, some applications can include centralized web servers (e.g., Apache HyperText Transfer Protocol (HTTP) server), databases (e.g., MySQL), and distributed data processing systems such as Apache Spark, Hadoop, and Flink.
The execution of multiple applications can occur on a cluster provisioned on a private infrastructure or the infrastructure of a public cloud provider, such as Amazon Web Services (AWS). Such clusters can be statically or dynamically partitioned between the applications. For example, using static partitioning, an Apache Hadoop cluster could be deployed on five physical server computing devices while another set of server computing devices could be used to deploy Apache Spark.
An alternative approach is to dynamically partition the available servers/cluster by exposing a shared resource pool to the data processing applications via an application-programming interface (API). In this approach, it is up to the application to determine which resources it will use to schedule tasks. This approach can greatly simplify the design of applications, as an application does not need to be concerned with the management and allocation of the underlying distributed resources (e.g., allocation and provisioning of Virtual Machines (VMs) or containers). Moreover, such approaches can increase resource utilization across the physical resources, as these resources can now be dynamically shared across multiple applications. For example, Apache Spark map/reduce tasks can run along with Apache Hadoop map/reduce tasks and web request processing tasks. Over the past years, the cloud and big data communities have developed various resource management systems following such a shared resource pool model.
One such system is Apache Mesos. Mesos abstracts the machine resources such as the Central Processing Unit (CPU), memory, and storage away from the underlying physical machines or VMs. The system enables dynamic sharing of underlying resources and multiplexes them to multiple applications. Applications running on Mesos “see” resources as one shared resource pool. Thus, a particular application does not own resources; instead these are managed by Mesos and offered to different applications according to a resource allocation policy. In Mesos, the resource allocation policy follows a “pessimistic” approach. Pessimistic resource allocation works by offering all resources to only one application at a time. Thus, it is up to the application to decide upon which resources it will launch tasks (e.g., a map/reduce task, a web server). While the application is utilizing these resources, they are locked to other applications. Once the application is done with its work, the resources are then reclaimed as part of the shared resource pool and handed to another application. The term “pessimistic” aptly describes this approach because its design fundamentally assumes that conflicts between competing applications would happen frequently, and thus they should be avoided by locking out other applications from attempting to launch their own tasks on the utilized resources.
Another system based on a shared pool model is Omega. In contrast to Mesos (which uses a pessimistic resource allocation policy), Omega implements an “optimistic” resource allocation policy. In Omega, every application sees the entire state (i.e., the shared pool) of the cluster, and thus no per-application locking of resources is performed. Accordingly, applications compete for the resources by attempting to launch tasks on the resources. Thus, conflict resolution is performed by Omega in the event that two applications attempt to launch tasks requiring the same resources at the same time. Accordingly, the term “optimistic” is used because of the fundamental assumption that conflicts between competing applications will rarely occur.
These approaches utilized by Mesos and Omega, for example, can be seen as techniques for enforcing concurrency control in distributed systems. In the former case (i.e., the pessimistic resource allocation policy), the cluster state can be manipulated by one application at a time. In the latter case (i.e., the optimistic resource allocation policy), the cluster state can be manipulated by multiple applications concurrently.