Field of the Disclosure
This disclosure relates generally to parallel computing, and more particularly to systems and methods for adaptive contention-aware co-scheduling of hardware contexts for parallel runtime systems on high-utilization shared machines.
Description of the Related Art
Traditionally, parallelism has been exploited in high performance computing (HPC) and multi-threaded servers in which jobs are often run on dedicated machines, or on fixed sets of cores (or hardware contexts) in a shared machine. Traditional HPC jobs have long, stable CPU-bound phases with fixed resource requirements. Traditional servers exploit the ability to process independent requests in parallel. There is often little parallelism within each request. This style of synchronization lets traditional servers run well on current operating systems.
As parallelism is becoming more ubiquitous, there is less programmer effort put into tuning software to run on a particular parallel machine, since there are more different types of machines capable of executing parallel workloads, and the differences between them make it difficult (if not impossible) to tune applications for each one. In addition, many emerging parallel workloads exhibit CPU demands that vary over time. For example, in graph analytic jobs, the degree of parallelism can both vary over time and depend on the structure of the input graph. Other examples include cases in which parallelism is used to accelerate parts of an interactive application (occurring in bursts in response to user input). Current operating systems and runtime systems do not perform well for these types of workloads (e.g., those with variable CPU demands and frequent synchronization between parallel threads). Typical solutions attempt to avoid interference between jobs either by over provisioning machines, or by manually pinning different jobs to different cores/contexts.
Software is increasingly written to run on multi-processor machines (e.g., those with multiple single-core processors and/or those with one or more multi-core processors). In order to make good use of the underlying hardware, customers want to run multiple workloads on the same machine at the same time (i.e. on the same hardware), rather than dedicating a single machine to a respective single workload. In addition, many parallel workloads are now large enough that a single workload can individually scale to use an entire machine; malleable (meaning, for example, that workloads can run over a varying number of hardware contexts, using abstractions such as multi-processing APIs to dynamically schedule loops rather than explicitly creating threads themselves); and/or “bursty” (meaning, for example, that their CPU demand can vary within a single execution, such as with a mix of memory-intensive and/or CPU-intensive phases, and other less resource-intensive phases). Much of the current work in thread placement involves single-threaded programs, and many of the solutions require modified hardware.