The scheduling of parallel jobs has long been an active area of research. It is a challenging problem because the performance and applicability of parallel scheduling algorithms is highly dependent upon factors at different levels: the workload, the parallel programming language, the operating system (OS), and the machine architecture. The importance of job scheduling strategies stems from the impact that they can have on the resource utilization and the response time of the system.
Time-sharing scheduling algorithms are particularly attractive because they can provide good response time without migration or predictions on the execution time of the parallel jobs. However, time-sharing has the drawback that communicating processes must be scheduled simultaneously to achieve good performance. With respect to performance, this is a critical problem because the software communication overhead and the scheduling overhead to wake up a sleeping process dominate the communication time on most parallel machines.
Over the years, researchers have developed parallel scheduling algorithms that can be loosely organized into three main classes, according to the degree of coordination between processors: gang scheduling (GS), local scheduling (LS) and implicit or dynamic coscheduling (DCS).
On the one end of the spectrum, GS ensures that constructing a static global list of the order in which jobs should be scheduled coordinates the scheduling of communicating jobs. A simultaneous context-switch is then required across all processors. Unfortunately, these straightforward implementations are neither scalable nor reliable. Furthermore, GS requires that the schedule of communicating processes be precomputed, which complicates the coscheduling of client-server applications and requires pessimistic assumptions about which processes communicate with one another. Finally, explicit coscheduling of parallel jobs interacts poorly with interactive jobs and jobs performing data input and output (I/O).
At the other end of the spectrum is LS, where each processor independently schedules its processes. This is an attractive time-sharing option due to its ease of construction. However, the performance of fine-grained communication jobs can be orders of magnitude worse than with GS because the scheduling is not coordinated across processors.
An intermediate approach developed at UC Berkeley and MIT is DCS. With DCS, each local scheduler makes independent decisions that dynamically coordinate the scheduling actions of cooperating processes across processors. These actions are based on local events that occur naturally within communicating applications. For example, on message arrival, a processor speculatively assumes that the sender is active and will probably send more messages in the near future. The main drawbacks of dynamic coscheduling include the high overhead of generating interrupts upon message arrival and the limited vision of the status of the system that is based on speculative information. Some aspects of these limitations are addressed in with a technique called Periodic Boost. Rather than sending an interrupt for each incoming message, the kernel periodically examines the status of the network interface, thus reducing the overhead for communication-intensive workloads.
FIGS. 1A and 1B depict global processor and network utilization (i.e., the number of active processors and the fraction of active links) during the execution of a transpose FFT algorithm on a parallel machine with 256 processors. These processors are connected with an indirect interconnection network using state-of-the-art routers. Based on these figures, there is obviously an uneven and inefficient use of system resources. During the two computational phases of the transpose, the network is idle. Conversely, when the network is actively transmitting messages: the processors are not doing any useful work. Many SPMD programs, including Accelerated Strategic Computing Initiative (ASCI) application codes such as Sweep3D share these characteristics. Hence, there is tremendous potential for increasing resource utilization in a parallel machine.
Another important characteristic shared by many scientific parallel programs is their access pattern to the network. The vast majority of scientific applications display bursty communication patterns with alternating spikes of impulsive communication with periods of inactivity.
FIGS. 2A–D depict network utilization by running four distinct scientific applications over a parallel machine with 256 processors. In all four cases, there are clear communication holes, i.e., periods of network inactivity, in the network. Therefore, there exists a significant amount of communication bandwidth that can be made available for other purposes.