Massively parallel processing (MPP) systems are computing systems comprised of hundreds or thousands of processing elements (PEs). While the power of a multiple-instruction multiple-data (MIMD) MPP computer system lies in its ability to execute independent threads of code simultaneously, the inherently asynchronous states of the PEs (with respect to each other) makes it difficult in such a system to enforce a deterministic order of events when necessary. Program sequences involving interaction between multiple PEs such as coordinated communication, sequential access to shared resources, controlled transitions between parallel regions, etc., may require synchronization of the PEs in order to assure proper execution.
Very fast synchronization is required for high-performance highly parallel processing with asynchronous processors (i.e., processors which execute programs independently of each other). This synchronization involves communication among the computational entities which work together on an overall application program. An important synchronization capability in any programming model is the "barrier". Barriers are points placed in the code beyond which no processor participating in a particular computation is to be allowed to proceed until all processors have arrived at their respective barrier points. Since PEs may wait at a barrier until alerted that all PEs have arrived at the barrier, it is very important to minimize the latency of the barrier mechanism. The latency of a barrier mechanism is defined as the period of time between when the last processor arrives at a barrier, and when all processors have been notified that the requirements of the barrier have been satisfied (the term for meeting these requirements is also called "satisfying the barrier"). During this period of time, all PEs may be idle waiting for the barrier to be satisfied. Hence barriers can be used as a serialization point in a parallel code (a point from which all PEs can proceed ahead, having waited for other PEs to be synchronized). Because a barrier defines a serialization point in a program, it is important to keep the barrier latency as low as possible, in order that the work to be done after the barrier can be started as soon as possible.
Another important synchronization capability in any programming model is the "eureka". Eureka are points placed in the code which notify all other processors of a particular group of processors participating in a computation that one processor has reached that eureka point in its computation. Typically, the eureka point is a notification that the processor has reached a solution to the computation (e.g., found the item being searched for by the group). Since processors will typically abandon the work they are doing upon being notified that another PE has arrived at its eureka point, it is also very important to minimize the latency of the eureka mechanism. The latency of a eureka mechanism is defined as the period of time between when the first processor arrives at a eureka, and when all processors have been notified that the eureka has been satisfied. During this period of time, all other PEs may be doing fruitless computation even though the eureka has been satisfied. As soon as the other PEs have been notified that the eureka has been satisfied, they can proceed to the next stage of their computation. Hence eurekas can also be used as a serialization point in a parallel code. Because a eureka defines a serialization point in a program, it is important to keep the eureka latency as low as possible.
Barriers and eurekas can be implemented entirely by software means, but software schemes are typically encumbered by long latencies and/or limited parallelism restricting how many processors can arrive at the barrier simultaneously without artificial serialization (e.g., atomic test-and-set or read-modify-write type operations which impose serialization by software convention or style).
Hardware support for barriers and eurekas, while reducing the latency problems associated with barriers and eurekas implemented by purely software means, can have other shortcomings that limit the utility of the mechanism in a production computing system. (Production computing systems, which support many users and often have less controlled environments than do more theoretical test-bed computing systems, demand that the barrier resource and eureka resource--like all resources--be partitionable among multiple users while maintaining protective boundaries between users.) In addition, the barrier resource and eureka resource must have functions which are rich enough to allow isolation between the operating system and the user executing within the same partition. Provision must also be made for fault tolerance to insure the robust nature of the barrier and eureka mechanism.
Hardware mechanisms may also suffer from an inability to operate synchronously. This inability may require that a PE, upon discovering that a barrier has been satisfied (all PEs have arrived at that barrier point in the program) or that a eureka has been satisfied (one PE has arrived at that eureka point in the program), wait until all PEs have discovered that the barrier or eureka has been reached before it may proceed through the next section of program code. The ability to operate synchronously enables the barrier mechanism and eureka mechanism to be immediately reusable without fear of race conditions (e.g., where, due to timing variances, some PEs cannot determine whether the barrier is being "armed" (initialized), has been armed, or has been satisfied).
Hardware mechanisms may also require that a PE explicitly test a barrier flag or eureka flag to discover when the respective condition has been satisfied. This can prevent a PE from accomplishing other useful work while the barrier remains unsatisfied, or force the programmer to include periodic tests of the barrier and eureka into the program in order to accomplish other useful work while a barrier or eureka is pending. This can limit the usefulness of a eureka mechanism when used as a means of terminating speculative parallel work (e.g., a database search) when the work has been completed (e.g. the searched-for item has been found).
Hardware mechanisms may require that the overall barrier tree be hardwired in a rigid configuration. While sub-portions of the barrier tree may be used for separate barrier networks, there is often little flexibility to configure various barrier trees.
It is often desirable to allocate some, but not all, of the processors in a parallel processing network to some particular application (software) program, and to use the other remaining processors for some other program. This allocation can change over time. It is thus desirable to have a subset of all processors in a parallel-processing network participating in a barrier and/or eureka synchronization, and to be able to dynamically change the configuration over time.