A cloud computing infrastructure has boosted the execution of mixed workloads in the same server, thus promoting better resource utilization and efficiency. However, it is still a challenge to fit heterogeneous workloads in the same cluster, e.g., when to schedule, where to place, and how to allocate their resources. Additionally, some workloads increase their resource usage on demand making the problem even harder to solve.
Current solutions for cluster management can be classified into three main scheduler architectures: (1) Monolithic, (2) Two-level, and (3) Shared-state. Monolithic schedulers use only a single centralized scheduling algorithm for all jobs, such as a High Performance Computing (HPC) scheduler. A Two-level scheduler is composed of a single active resource manager that offers resources to multiple, parallel, independent scheduler frameworks that accept or deny resources. See, for example, Hindman et al., “Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center,” Proceedings of the 8th USENIX conference on Networked systems design and implementation (NSDI '11), pgs. 295-308 (March/April 2011). Shared-state architectures are composed of multiple, parallel schedulers that concurrently access the cluster resources as transactions. See, for example, M. Schwarzkopf et al., “Omega: flexible, scalable schedulers for large compute clusters,” Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13), pgs. 351-364 (April, 2013) (hereinafter “Schwarzkopf”). There are notable trade-offs between these scheduler architectures. For instance, a Monolithic scheduler has total control of the cluster and might plan the best global placement, but is hard to diversify by adding a new policy and specialized implementation, and may not scale-up to the cluster size. Two-level schedulers are more flexible than Monolithic schedulers by providing scheduler parallelism, but they reduce the resource visibility and make decisions based on a holistic view of the cluster resources harder to perform (such as planning local placement for workloads, or performing workload migrations based on performance, energy efficiency and security). Moreover, when resources are pre-allocated for the schedulers, unused resources can be wasted. Shared-state has the global view of the cluster resources to make all the holistic decisions and is very scalable. However, a shared-state scheduler is very susceptible to resource racing and there is no way to guarantee resource allocation for some workloads.
Currently, no techniques are known that combine all of the advantages of the previous architectures and relieve the above-described drawbacks. In current approaches, the schedulers have all of the resources pre-allocated such as in the Monolithic and Two-level approach, or they only compete for resources such as in the Shared-state approach.
Further, complex schedulers bring about problems such as scalability, maintainability and flexible adaptation. To address this problem, the Shared-state approach (see, for example, Schwarzkopf) splits the complex scheduler into many smaller schedulers rather than having only one monolithic scheduler with all the workload's logic. On one hand, this Shared-state approach might be beneficial by providing a global view of the cluster for all workload deployment to possibly perform the best global deployment. It might also increase the scalability by responding to workload requests in parallel and it might reduce the response time (due to a reduced complexity). On the other hand, if multiple schedulers attempt to claim the same resource simultaneously, only one will be successful and the others will have to restart the scheduling process again. Hence, if conflicts occur too often, then the scheduling time and the response time will increase.
Certain techniques classify workloads, detect conflicts, or select general policies. See, for example, J. Bobba et al., “Performance Pathologies in Hardware Transactional Memory,” Proceedings of the 34th annual international symposium on Computer architecture (ISCA '07), pgs. 81-91 (June 2007); M. F. Spear et al., “Conflict Detection and Validation Strategies for Software Transactional Memory,” Proceedings of the 20th international conference on Distributed Computing (DISC '06), pgs. 179-193 (2006); and C. Delimitrou et al., “Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters,” Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (March 2013), respectively. However, a solution does not currently exist that collectively takes all of these factors into consideration to improve and automate workload conflicts resolution based on the infrastructure state. Moreover, the current solutions do not consider the specific capabilities that emerge from workload collisions in a shared-state cluster, such as application interferences, server overloading, etc.
Further, most current solutions for conflict detection and policy selection use linear programming which requires a large dataset for predictions, or machine learning which relies on enormous amounts of data for the test phase, neither of which are practical for real-world applications. Also, most current conflict detection and policy selection techniques do not perform well with missing entries. In that regard, some approaches exist that combine algorithms for matrix factorization for classification and log-likelihood function to fill in predictions for missing entries. These approaches do not however select a conflict resolution policy based on workload classification for resource usage intensiveness of central processing unit (CPU), memory, network, disk; workload execution time; and detection of possible non-deterministic and deterministic conflicts.
Accordingly, improved techniques for datacenter cluster workload management, including improved techniques for conflict detection and policy selection would be desirable.