Massively parallel processing (MPP) systems are computing systems comprised of hundreds or thousands of processing elements (PEs). While the power of a multiple instruction-multiple data (MIMD) computer system lies in its ability to execute independent threads of code simultaneously, the inherently asynchronous states of the PEs (with respect to each other) makes it difficult in such a system to enforce a deterministic order of events when necessary. Program sequences involving interaction between multiple PEs such as coordinated communication, sequential access to shared resources, controlled transitions between parallel regions, etc., may require synchronization of the PEs in order to assure proper execution.
An important synchronization capability in any programming model is the barrier. Barriers are points placed in the code beyond which no processor participating in a computation may proceed before all processors have arrived. Since processors wait at a barrier until alerted that all PEs have arrived, the latency of the barrier mechanism can be very important. The latency of a barrier mechanism is the propagation time between when the last processor arrives at a barrier, and when all processors have been notified that the barrier has been satisfied. During this period of time, all PEs may be idle waiting for the barrier to be satisfied. Hence barriers are a serialization point in a parallel code.
Barriers can be implemented entirely by software means, but software schemes are typically encumbered by long latencies and/or limited parallelism restricting how many processors can arrive at the barrier simultaneously without artificial serialization. Because a barrier defines a serialization point in a program, it is important to keep the latency as low as possible.
Hardware support for barriers, while addressing the latency problems associated with barriers implemented by purely software means, can have other shortcomings that limit the utility of the mechanism in a production computing system. Production computing systems demand that the barrier resource (like all resources) be partitionable among multiple users while maintaining protective boundaries between users. In addition, the barrier resource must be rich enough to allow division between the operating system and the user executing within the same partition. Provision must also be made for fault tolerance to insure the robust nature of the barrier mechanism.
Hardware mechanisms may also suffer from an inability to operate synchronously. This inability may require that a PE, upon discovering that a barrier has been satisfied (all PEs have arrived at that point in the program), wait until all PEs have discovered that the barrier has been reached before it may proceed through the next section of program code. The ability to operate synchronously enables the barrier mechanism to be immediately reusable without fear of race conditions.
Hardware mechanisms may also require that a PE explicitly test a barrier flag to discover when the barrier condition has been satisfied. This can prevent a PE from accomplishing other useful work while the barrier remains unsatisfied, or force the programmer to include periodic tests of the barrier into the program in order to accomplish other useful work while a barrier is pending. This can limit the usefulness of a barrier mechanism when used as a means of terminating speculative parallel work (e.g., a database search) when the work has been completed (e.g. the searched-for item has been found).