Some algorithms have inherently parallelizable structure due to non-overlapping data. For example, matrix multiplication requires each row of one matrix to be multiplied by the column of a second matrix (the inner product). This matrix product could be computed by multiple threads concurrently in a number of different ways. For example, one thread could multiply row a1 by each column b1 while another n−1 threads do the remaining rows. The outputs do not overlap as each row and column generates one distinct scalar in the product matrix.
This notion gives rise to an idea of a primitive that contracts out disjoint work items to threads, knows when the threads have completed the associated work, and hence, is safe to process and generate the output. If the work to be done has multiple steps this generalizes to a primitive that handles multiple stages where at each stage threads wait for all other threads to complete before all move to the next stage. Each stage may safely read the data from previous stages as it can be guaranteed that every thread has completed previous stages before continuing to the next. In the literature this construct is known as a barrier.
One conventional algorithm used for barrier synchronization is a sense reversing barrier that uses a thread local Boolean flag to maintain the parity of the work stage and counters to track the threads joining the barrier. In earlier work, the sense reversing barrier was enhanced to eliminate the thread local storage and provide a barrier that could spin and block on late arrival of threads. However, this generates a large structure footprint (e.g., in memory for both user mode and kernel mode). Moreover, attempts at memory contention issues have been addressed with constructs such as the combining tree barrier, but at the expense of additional storage.
If a thread is put to sleep to wait for some event to come true, there is a fixed cost associated with this process, which is basically the cost to go into the operating system, go to sleep, and have a scheduler select another thread. This involves swap time. If the amount of time that it actually takes to complete a process is small relative to the swap time, the overhead associated with going to sleep and coming out of sleep can be avoided.