In large-scale parallel machines, a common operational requirement is for barrier synchronizations—stopping all cores participating in the barrier until every core reaches the barrier, and then releasing the barrier such that all cores may proceed. Other types of such collective network operations include multi-cast/broadcast support and reduction operations, in situ or otherwise.
Fundamentally, one approach to the barrier/multicast/reduction (BMR) problem is to create BMR libraries via software constructs (trees, global variables, etc.) or hardware support. This approach typically includes significant overhead in design complexity as well as execution sequences for atomic support, cache bounces, etc. Alternately, some machines provide dedicated hardware inside existing on-die and off-die interconnects, cores, or both, typically represented as a special type of protocol inside existing networks.
The advantage to the hardware solution to BMR sequences is the greatly reduced amount of memory traffic and lower latency, both of which can lead to significant energy savings in large-scale machines. The advantage to the software solution is that additional legacy is not introduced to the machines, nor will the software solution fail to work on some machines that may not provide some or all hardware BMR features.
In either approach, there is a substantial complexity in supporting flexible BMR systems, either through software implementation and validation or through reconfigurable protocol designs inside existing physical networks. The need for configurability stems from the division of work across large machines, such that only a fraction of distributed agents are likely participating in any given barrier synchronization.
Thus, there is a continuing need for a new scheme for implementing barrier, multicast, and reduction operations.