Field of the Invention
The present invention generally relates to computer architectures and, more specifically, to batched replays of divergent operations.
Description of the Related Art
A common practice in parallel processing systems is to design a processor that executes some number of threads simultaneously. Each thread may execute in a separate execution pipeline within the processor. When such threads all need to execute the same instruction sequence (typically with different data for each thread), there are tangible benefits to sharing the control structures of the threads. For example, only one instruction has to be fetched, and all threads then execute that same instruction. This type of operation may be found on single instruction multi-thread (SIMT) processors and single instruction multi-data (SIMD) processors.
As execution proceeds in parallel, different threads may access a common resource, such as a shared memory, in a manner that may cause the threads to encounter a resource conflict. For example, the threads may execute a memory access operation, such as a load instruction, where the set of memory locations spans over two or more cache lines. Such a load instruction may be called a “divergent” operation because the memory locations required by the various threads are on divergent cache lines. In such situations, the pipeline transfers one of the cache lines that some of the threads are accessing, and those threads are able to complete the memory access operation. However, the other threads pointing to locations within a different cache line are not able to complete the memory access operation and remain unserviced. Thus, with a single pass through the pipeline, some threads are able to complete the memory access operation while other threads are not. Absent a means to process multiple execution cycles, the operation is not able to complete successfully.
One approach to implementing multiple execution cycles is to reinsert the instruction into a prior stage of the processing pipeline and execute the load instruction again for the threads that were not about to access data from their target memory address locations. Such a technique is called a “replay” operation. In essence, where a stage in the pipeline performs an operation that cannot be completed in the current cycle, the pipeline “replays” the load instruction once for each cache line that contains at least one target address until each thread performed the relevant memory access operation. During this process, a portion of the pipeline is utilized to complete the replay operations. Therefore, the pipeline is stalled to prevent new instructions from entering the pipeline until all replay operations have been completed. One drawback of this approach is that the pipeline is stalled until all replay operations complete. Upstream instructions may not advance in the pipeline until the pipeline stall is released, which reduces overall system performance. An additional drawback is that the parallel processing system may not be able to stall all pipeline stages within one pipeline stage delay. If the parallel processing system cannot stall the pipeline in time, one or more new instructions entering the pipeline may be improperly discarded or the replay operation is likewise discarded. In either case, the new instruction or the replay operation does not properly complete.
Another approach to implementing multiple execution cycles is to reinsert the instruction farther back in the pipeline. With this approach, the instructions being “replayed” are interleaved in the pipeline along with new instructions, which reduces the frequency of pipeline stalls, thereby increasing pipeline performance. One drawback of this approach, however, is increased latency of the instructions being replayed. For example, consider a processor with 32 simultaneously executing threads. In a divergent load operation, the target address locations for the 32 threads may diverge across 32 different cache lines. If a replay is inserted one pipeline stage back, then the pipeline may stall for 31 pipeline cycles while the 31 replay operations execute. Inserting the replay six pipeline stages back increases the latency for the instructions being replayed to six pipeline stages multiplied by 31 replay operations or 186 pipeline cycles.
As the foregoing illustrates, what is needed in the art is a more efficient way to execute replay operations for divergent operations.