The present disclosure relates generally to a dedicated memory structure (that is, hardware device) holding data for detecting available worker thread(s) and informing available worker thread(s) of task(s) to execute.
When creating a parallel region in code, critical data is passed from a master thread to one or more worker threads. When such data is kept in memory, such as a main memory, this forces the master thread to write the data to the main memory. Then, the one or more worker threads read the data from the main memory. Given the fact that there are typically many items that must be written and read, these operations often result in very long latencies that are on the critical path (since the worker thread(s) must first be told of where to find their work prior to them starting to do the work).
In a system with data caches, this latency can be shortened somewhat if the data is cached. However, since the data is written by one thread (such as the master thread) and read by one or more worker threads, communication misses will typically occur when the master thread and the worker thread(s) do not share the same caches in the hierarchy. For example, if two threads do not share an L1 cache but share an L2 cache, then the communication cost will be represented by one or more L1 misses, depending on the size of the data structure being communicated. If among all worker threads, one or more do not share an L1 cache or L2 cache, then the critical path cost will be dominated by the thread having the longest latency, i.e. the one missing in the L3 in this example.
A further problem with caches is that the data being communicated in cache may not even initially reside in the cache. Since most modern caches read on cache line allocation, then the sequence will start by reading the entire cache line from main memory to the caches prior to issue of the first write into that data structure. Depending on the size of the cache line, this reading of the line may become a substantial cost, possibly even the dominant cost of the entire communication from the master thread to the worker thread.
A similar cost occurs with storing the bitvectors that represent which thread(s) are busy and which thread(s) are free to work in a given parallel region in code. These bits are read and written by a master thread attempting to acquire other thread(s) as worker thread(s). When these bitvectors are in main memory or caches, they result in long latencies when accessed.
One reason why it is important to reduce such overhead and critical path is that the efficiency of parallelism in general, and OpenMP (an industry wide standard to manage thread level parallelism) in particular, is critically dependent on Amdahl's law, e.g. on reducing the sequential part of a program so as to achieve the higher speedup expected on a platform with a large number of threads. Without reducing the sequential part (here, the time that it takes to invoke the parallel threads and have them start the useful work), there is less of an incentive to use parallelism, and thus less of an incentive to use machines such as Blue Gene/Q (BGQ) that rely on a large number of power efficient processors as building blocks for its supercomputing power.