In a parallel computing environment (such as for example the IBM Parallel Environment (PE) running on the operating system sold under the trademark AIX® by IBM Corporation), “synchronizing collective operations” are operations in which a set of processes (typically every process) participates and no single process can continue until every process has participated. Illustrative examples of synchronizing collective operations from the MPI interface of the PE include MPI_Barrier, MPI_Allreduce, and MPI_Allgather . While required for a large class of parallel algorithms and therefore quite common, synchronizing collective operations can pose serious challenges to the performance and scalability of a parallel job/application. In particular, such synchronous collectives are vulnerable to interference from random occurrences of routine system and/or daemon activity (e.g. timer decrement interrupt processing, daemons associated with file system activity, daemons associated with membership services, monitoring daemons, cron jobs, etc.), since a single instance of a laggard process will impede or block the progress of every other process. This can result in a cascading effect, causing serialization and degrading performance and scalability of the parallel application.
The cascading effect is especially detrimental in the high performance computing (HPC) context of large-scale parallel environments, such as those of interest at national laboratories and supercomputing centers (hereinafter “HPC centers”), due to the large number of processors (CPUs) involved and the percentage time of each used by system and daemon activity. Experiments conducted by Applicants at the Lawrence Livermore National Laboratory (LLNL) have shown that typical operating system and daemon activity consumes about 0.2% to 1.1% of each CPU for large dedicated systems, such as for example the system sold under the trademark RS/6000® by IBM Corporation having 16 processors per node. As such, even minimal occurrences of random interfering system operations/activities/events can have a compounded effect when taken across all processors and detrimentally impact synchronizing collective operations of a parallel job/application, especially during synchronization or fine-grain parallelism. It is notable that these large-scale parallel environments typically perform a single parallel job consisting of thousands of cooperating processes occupying multiple machines/nodes dedicated to the parallel job (i.e. dedicated job co-scheduling). Since the machines are usually symmetric multiprocessing (SMP) nodes (i.e. each having two or more similar processors connected via a high-bandwidth link and managed by one operating system where each processor has equal access to I/O devices), a node is assigned as many processes as there are processors on the node and with each process acting as if it has exclusive use of the processor. In this environment, fair share CPU scheduling and demand-based co-scheduling required for networks of workstations (NOWs) are not necessary or applicable. Typical time-quanta involved in this “dedicated job co-scheduling” context is on the scale of operating system timer-decrement and/or communication interrupts.
As illustrated in FIG. 1, the impact of random interfering events on synchronizing collectives is determined in large measure by the degree/extent of overlap of the random events (as well as the synchronizing collectives) between processors. In particular, FIG. 1 shows two separate runs, indicated at reference characters 10 and 11, of the same eight-way parallel application on two nodes: Node 1 indicated at reference character 12, and Node 2 indicated at reference character 13, having four processors each. In the first run 10, system activity indicated by the pattern 15 occurs at purely random times in each of the eight processors. Periods utilized by the parallel application are represented by pattern 14. As a result, operations that require every processor can make progress only when the pattern 14 is present across all eight processors. The pattern indicated by reference character 16 represents those overlapping periods in time when the application is running across all eight processors. In the second run 11, the same amount of system activity occurs (i.e. there is the same total amount of the pattern 15) as in the first run 10. In the second run 11, however, these periods of system activity 15 are largely overlapped between processors. In this manner, much more time is available for parallel application activities that require all processors, as shown by the longer spans of the pattern 16. For clusters comprised of SMP nodes, both inter-node and intra-node overlap is an issue, and it is desirable to ensure overlap between nodes as well as on-node. For example, while the second run 11 shows very good on-node overlap of operating system interference, there is little cross-node overlap of operating system interference.
And parallel applications are most susceptible to operating system interference during synchronization or fine-grain parallel operations such as ring communication patterns, barriers, reductions, etc. For example, FIG. 2 shows a bulk-Synchronous SPMD model of parallel application, with each cycle containing one or more such fine-grain operations. Each process of a parallel job executes on a separate processor and alternates between computation 17 and communication 19 phases. The importance of these collective synchronizing operations is dependent on the duration of computation and communication periods. Barrier or reduction phases are indicated by pattern 18, and waiting periods are indicated by pattern 20. Typical cycles can last between from a few milliseconds up to several seconds.
The ability of a large processor count cluster to perform parallel applications with synchronizing collectives will therefore depend heavily upon the degree of interference introduced by the operating system. Taking for example MPI_Allreduce (hereinafter “Allreduce”) from the MPI interface for the AIX® system, experimental measurements taken from jobs run on “ASCI White” and “ASCI Q” systems at LLNL and Los Alamos National Laboratory (LANL) indicate Allreduce consume more than 50% of total time at 1728 processors. A second study conducted by different researchers on ASCI Q measured Allreduce to consume around 50% of total time at 1728 processors, and over 70% of total time at 4096 processors. Interference to these operations would therefore have a significant impact on the overall application. Moreover, the performance of Allreduce also illustrates the poor scaling of synchronizing collective operations due to interfering operations, as discussed in the “Performance Results” section of the Detailed Description of experiments conducted by the Applicants. Developers and users of parallel applications have learned to deal with poor Allreduce performance by leaving one CPU idle on a multi-CPU (MP) node. This approach leaves a reserve CPU for processing daemons which would otherwise interfere with fine-grain activities. However the approach is undesirable since such strategies enforce a ceiling on machine efficiency. In addition, the approach does not handle the occasional event of two concurrent interfering daemons. And it also artificially limits the maximum scalability of the machine as one CPU is forfeited for every node on the machine.
It is notable that problematic interference such as timer decrement interrupt processing and daemon activities are inherent in UNIX® derivatives, and are not specific to AIX®, which results in large variability and reduced synchronous collective performance in large Unix-based systems. This is because operating systems based on UNIX® and its variants (including AIX® and Linux®) were originally developed without consideration of the types of issues arising in parallel applications spanning multiple computers and operating system instances, instead viewing them as thousands of independent processes.
For example, while the AIX® operating system is able to run work simultaneously on multiple processors, it is not designed to start work simultaneously on multiple processors. There is no issue when processors are idle: if two threads are readied almost simultaneously, two idle processors will begin running them essentially immediately. AIX® handles the busy processor case differently. When work is made ready in the face of busy processors, it must wait for the processor to which it is queued. Should another processor become idle, it may beneficially “steal” the thread, but this is atypical when running large parallel applications. If the newly ready thread has a better execution priority than the currently running thread on its assigned processor, the newly ready thread pre-empts the running thread. If the processor involved is the one on which the readying operation occurred, the pre-emption can be immediate. If not, the other, busy, processor must notice that a pre-emption has been requested. This happens whenever its running thread (1) enables for interrupts in the kernel, as during a system call; (2) takes an interrupt, as when an I/O completes or a timer goes off; or (3)blocks, as when waiting for I/O completion such as for a page fault. The problem is that this can represent a significant delay, which can be up to 10 msec until the next routinely scheduled timer interrupt gives the busy processor's kernel the opportunity to notice and accomplish the pre-emption. The AIX® kernel is known to already contain a capability called the “real time scheduling” option, which solves a part of this problem. When this option is invoked, the processor causing a pre-emption will force a hardware interrupt to be generated for the processor on which the pre-emption should occur. While this is not immediate, the pre-emption can typically be accomplished in tenths of a millisecond, as opposed to several milliseconds without this option. The existing “real time scheduling” option, however, only forces an interrupt when a better priority thread becomes runnable. It does not force an interrupt for a “reverse pre-emption,” which occurs when the priority of a running thread is lowered below that of a runnable, waiting thread. Additionally, the real time scheduling option forces an interrupt to only one processor at a time. Once such an interrupt is “in flight,” it does not generate further interrupts if the processor involved would be eligible to run the thread on whose behalf the previous interrupt had been generated.
There is therefore a need to improve the scalability of large processor count parallel applications by improving kernel scheduling, and in particular providing collaborative dedicated job co-scheduling of the processes both within a node and across nodes using scheduling policies that include a global perspective of the application's process working set without. Since collective operations such as “barriers” and “reductions” are known to be extremely sensitive to even usually harmless events such as context switches among members of a working set, such co-scheduling techniques would greatly diminish the impact to fine grain synchronization even when interference present in full-featured operating systems such as daemons and interrupts cannot be removed. Such fine grain synchronizing activities can proceed without having to experience the overhead of making scheduling requests, and thereby mitigate the effects of system software interference without the drawbacks of underutilized MP nodes.