In the processing of single instructions that operate on multiple data elements (SIMD), it is common to execute such instructions as multiple logical threads in a single execution unit (EU) thread. Multiple EU threads may be organized in a thread group. Some developers refer to an EU thread as a warp. SIMD operations are often used in graphics processing, e.g., in shaders and other applications.
A graphics processing unit (GPU) may consist of a number of execution units (EUs), each of which is a processing core. In general every EU runs some number of EU threads such that these EU threads appear to be running in parallel. In fact only a limited number of EU threads is running on an EU at any given time, and sometimes just one EU thread. An EU thread running on an EU may be pre-empted by another EU thread for the sake of efficiency. For example, an EU thread may access memory and wait for the operation's completion. In such a case, another EU thread assigned to this EU may be taken from the queue and continue its execution. Because of this, the GPU may be more efficiently utilized in comparison to a system having sequentially running EU threads from the beginning to the end on a given EU. Generally, EU threads may be given access to an EU in a round-robin mode, for example.
An EU thread represents a single instruction with multiple data elements (SIMD). A single EU thread may run a number of logical threads (e.g., 16, 32, or 64, depending on the processor or system vendor). For example, consider a program that performs the multiplication x*x. There are 3 instructions:    Load x    Multiply y=x*x    Store y
The first logical thread would consist of these three instructions, where x=1. Likewise, the second logical thread has x=2, the third has x=3, and so on. Assume that the EU thread works in “SIMD 8” mode (executing eight logical threads in parallel). Then the EU thread performs only these three instructions to calculate multiplications for eight numbers. In operation, the following actions take place:    Load 1, 2, 3, 4, 5, 6, 7, 8 (in the eight respective logical threads, in parallel)    Multiply 1*1, 2*2, 3*3, 4*4, 5*5, 6*6, 7*7, 8*8 (respectively, in parallel)    Store 1, 4, 9, 16, 25, 36, 49, 64 (respectively, in parallel)
This multiplication for all eight input numbers is performed at once (e.g., in one clock cycle). The hardware performs this single (the same) arithmetic operation on multiple data elements in parallel.
Synchronization may be a problem in some situations, however, given that different EU threads may take different amounts of time to complete for more complex operations. This issue is often addressed through the use of a synchronization barrier. Here, each EU thread in a thread group halts when a certain point is reached. Once all EU threads have reached this point, then the thread group may continue. This prevents error conditions that would occur in situations where, for example, one or more EU threads may require the results of another EU thread.
Given the use of synchronization barriers, efficiency problems may occur, however. In some situations, several EU threads may have to wait at a synchronization barrier because they have reached the predefined stopping point, while another EU thread continues to execute. The still-executing EU thread may be processing a loop, for example, that requires an extended time to complete. The waiting EU threads have to remain idle, still occupying their respective execution units, and cannot proceed until the executing EU thread finishes. This may represent inefficient use of computing resources and may slow down an application.
In the drawings, the leftmost digit(s) of a reference number identifies the drawing in which the reference number first appears.