1. Field of the Invention
Embodiments of the invention relate generally to parallel thread program execution, and more specifically to coalescing memory barrier operations across multiple parallel threads.
2. Description of the Related Art
Conventional parallel processing architectures support execution of multiple threads. A memory transaction is considered “performed” when it has been committed to memory order and is visible to any thread, processing unit, or device that may access the memory, e.g. a store or write operation has been “committed” to memory and subsequent load or read operations will see the stored data. Memory barrier instructions (or fence instructions) are used to order the performance of memory transactions. From the standpoint of one thread, processing unit, or device, when it executes a memory barrier instruction, it waits until all its prior memory transactions have committed to memory before executing any subsequent memory transactions. Within that thread, memory transactions that occur after the memory barrier instruction in program order are delayed until all of the threads' memory transactions that occur prior to the memory barrier instruction in program order are committed to memory. The results of committed memory transactions may be visible to other threads, and the memory barrier instruction delays the requesting thread until all its prior memory transactions are visible to other threads. After waiting for a memory barrier, the requesting thread may then synchronize or communicate with other threads knowing that they can access the results of its prior memory transactions. Parallel processors that support large numbers of parallel threads that cooperate or communicate, such as multi-threaded processors that execute thousands of parallel threads, need to frequently execute memory barrier instructions to ensure proper ordering and visibility of memory transactions. A conventional memory barrier instruction waits until the request travels to the system memory commit point where results are visible to all threads, processing units, and devices in the system, and then waits until an acknowledgement returns to the requesting thread. Round-trip latency to the system memory commit point can be very long, e.g. hundreds of cycles. Therefore, execution of memory barrier instructions can reduce the instruction processing throughput of a conventional parallel processing architecture since the multiple requesting threads are idle during execution of a long-latency memory barrier (waiting for memory transactions to be committed to memory and for results to become visible to all other threads).
More recently, parallel processing architectures allow for sets of parallel threads to execute cooperatively together at different thread grouping levels. For example, a set of parallel threads comprising a cooperative thread array (CTA) can execute together within a multi-threaded processor. Multiple CTAs can execute concurrently and cooperate within a processor or among several processors, and also cooperate with other threads, processors, and devices in large systems. A CTA program may need to order memory transactions among the set of threads comprising the CTA, or among the threads executing in the same processor, or among different CTAs in different processors, or among the threads, processors, and devices of the whole system. Therefore, execution of memory barrier instructions can further reduce the instruction processing throughput of a parallel processing architecture when threads are cooperating and interacting at multiple levels of cooperation across a parallel system having many parallel threads and processors.
Accordingly, what is needed in the art is an improved technique for performing a memory barrier operation across multiple parallel threads that are cooperating at multiple levels in a parallel system.