Field of the Disclosure
The present disclosure relates generally to processing devices and, more particularly, scheduling pipelined instructions in processing devices.
Description of the Related Art
Processing devices such as central processing units (CPUs), graphics processing units (GPUs), or accelerated processing units (APUs) implement pipelines so that operations associated with multiple threads can be concurrently executed in the pipeline. Contexts for each of the threads that are available for execution by the pipeline are stored in a set of registers that are implemented in the processing device. The processing device can hide the latency of individual threads by switching between the threads that are available for execution. For example, a memory (such as a RAM) may take several cycles to return information from the memory to a first thread. The processing device may therefore launch one or more other threads or perform operations associated with previously launched threads while waiting for the information requested by the first thread to be returned to the processing device. However, space and cost considerations limit the number of registers in the processing device, which in turn limits the number of contexts that can be stored for pending threads, and therefore ultimately limits the number of threads that are available for execution. At least in part because of these constraints, the pipelines in processing devices frequently stall (i.e., temporarily stop executing any instructions) because all of the threads that are available for execution are waiting for previous operations to complete. Some applications stall for more than 90% of the time for reasons including limits on the number of available threads, limits on the number of thread groupings that include multiple threads that run the same instruction at the same time, high memory access latencies, limited memory bandwidth, long execution times for operations such as transcendental floating-point operations, and the like.