1. Field of the Disclosure
This disclosure relates generally to parallel computing, and more particularly to systems and methods for performing fine-grained scheduling of work in runtime systems.
2. Description of the Related Art
Traditionally, parallelism has been exploited in high performance computing (HPC) and multi-threaded servers in which jobs are often run on dedicated machines, or on fixed sets of cores (or hardware execution contexts, also referred to as hardware contexts) in a shared machine. Traditional HPC jobs have long, stable CPU-bound phases with fixed resource requirements. Traditional servers exploit the ability to process independent requests in parallel. There is often little parallelism within each request. This style of synchronization lets traditional servers run well on current operating systems.
As parallelism is becoming more ubiquitous, there is less programmer effort put into tuning software to run on a particular parallel machine, since there are more different types of machines capable of executing parallel workloads, and the differences between them make it difficult (if not impossible) to tune applications for each one. In addition, many emerging parallel workloads exhibit CPU demands that vary over time. For example, in graph analytic jobs, the degree of parallelism can both vary over time and depend on the structure of the input graph. Other examples include cases in which parallelism is used to accelerate parts of an interactive application (occurring in bursts in response to user input). Current operating systems and runtime systems do not perform well for these types of workloads (e.g., those with variable CPU demands and frequent synchronization between parallel threads). Typical solutions attempt to avoid interference between jobs either by over provisioning machines, or by manually pinning different jobs to different cores/contexts.
Software is increasingly written to run on multi-processor machines (e.g., those with multiple single-core processors and/or those with one or more multi-core processors). In order to make good use of the underlying hardware, customers want to run multiple workloads on the same machine at the same time (i.e. on the same hardware), rather than dedicating a single machine to a respective single workload. In addition, many parallel workloads are now large enough that a single workload can individually scale to use an entire machine; malleable (meaning, for example, that workloads can run over a varying number of hardware contexts, using abstractions such as multi-processing APIs to dynamically schedule loops rather than explicitly creating threads themselves); and/or “bursty” (meaning, for example, that their CPU demand can vary within a single execution, such as with a mix of memory-intensive and/or CPU-intensive phases, and other less resource-intensive phases).
Parallel runtime systems are often based on distributing the iterations of a loop in parallel across multiple threads in a machine. One issue is how to decide which thread should execute which iterations. If this is done poorly then either (1) load imbalance may occur, with some threads left idle without work while other threads are “hoarding” work, or (2) excessive overheads may be incurred, with the cost of scheduling work outweighing the speed-ups achieved by parallelism. To address this, programmers often need to tune workloads to indicate the granularity at which work should be distributed between threads. Doing this tuning well depends on the machine being used, and on its input data. However, parallelism is increasingly used in settings where manual tuning is not possible, e.g., software may need to run across a wide range of hardware, or a wide range of different inputs.