The desire to increase the execution speed of computer instructions has led to the implementation of parallel processing systems. Parallel processing systems include multiple processing units and/or multiple cores on each processing unit. Each processing core can execute computer instructions simultaneously. In addition, processes have been divided into multiple threads such that multiple threads can be executed simultaneously by separate processing units and/or cores.
Data parallelism refers to the performance of simultaneous operations (e.g., executing multiple threads simultaneously) across large sets of data (e.g., arrays, matrices, vectors, sets, trees, etc.). Example data parallel operations include element-wise operations, prefix-sum operations, reduction operations, permutation operations, etc.
Some data parallel operations require that data be operated in a specific order. For example, when a second operation uses the results of a first operation, the first operation must be completed before the second operation to ensure that the result of the first operation is available for use by the second operation. The ordering of data parallel operations has typically been handled using barriers. In such an arrangement, when an executing thread reaches a barrier instruction, the executing thread stops and waits for all other threads to complete before proceeding with its execution. One disadvantage of the use of barriers is that even if an executing thread does not require the result of all other threads to continue executing (i.e., it is not dependent on the other threads), the executing thread must wait at the barrier until all other threads have completed.