1. Technical Field
The present invention relates generally to load synchronization and, in particular, to load synchronization with streaming thread cohorts.
2. Description of the Related Art
Consider multiple thread cohort members (i.e., threads in lockstep or Single Instruction Multiple Data (SIMD) slots) which are working in lockstep fashion (e.g., SIMD fashion) on independent data (e.g., independent network streams).
Particularly if threads in lockstep or SIMD slots work on independent data and have a separated control-flow, one of the threads will be confronted with the problem of more or less random stalls, because one of the thread cohort members (i.e., threads, or SIMD slots) faces a non-neglectable load latency that forces all of its siblings to wait until the necessary data is available. Good examples for such a problem are represented by finite state machines or a scanner which, based on the input, can go to states which foresee a variation or condition within the control-flow of different thread cohort members. This means that one thread cohort member can outpace another one.
FIG. 1 shows two Single Instruction Multiple Threads (SIMT) threads 101 and 102 without synchronization, in accordance with the prior art. The two threads load independent data from memory 110 into, e.g., a register at different points in time. While one thread 101 loads a new block 120, the thread's SIMT sibling 102 has to nullify the instruction and wait for the loading before new block 102 can process the data indicated by its current input pointer.
FIG. 2 shows four SIMT threads 201, 202, 203, and 204 without synchronization for a multi-buffer case, in accordance with the prior art. The multiple threads 201, 202, 203, and 204 load independent data from memory 210 into one of multiple buffers in, e.g., one or more of the registers at different points in time. While one thread 201 loads multiple new blocks (e.g., through direct-memory-access (DMA) or instructions) into its buffer, the thread's SIMT siblings 202, 203, and 204 have to nullify the instruction and wait.