Field of the Invention
The present invention generally relates to multithreaded programming and, more specifically, to hierarchical staging areas for scheduling threads for execution.
Description of the Related Art
A conventional multi-threaded processor supports the concurrent execution of multiple different threads. For example, single-instruction, multiple-data (SIMD) instruction issue techniques could be used to support parallel execution of a large number of threads without providing multiple independent instruction units. Alternatively, single-instruction, multiple-thread (SIMT) techniques could be used to support parallel execution of a large number of generally synchronized threads, using a common instruction unit configured to issue instructions to a set of processing engines.
Conventional multi-threaded processors include a scheduler that is configured to allocate hardware resources to each thread or group of threads on the multithreaded processor, and to then select threads to be issued for execution. The hardware resources could include, for example, space in random access memory (RAM) for instructions, execution bandwidth within an execution pipeline, and other hardware resources needed for execution.
Once the hardware resources have been allocated to each thread or thread group, the scheduler arbitrates between those entities and selects one or more to be issued for execution. The scheduler performs this process iteratively in order to issue multiple threads or groups of threads for execution. In general, the scheduler has a limited time window within which to arbitrate between the threads and select threads or thread groups to be issued. Upon execution, the threads rely on the allocated hardware resources to perform various processing tasks.
The above approach provides reasonable efficiency when the number of thread groups is low, such as, e.g., 16, as is common with conventional multi-threaded processors. However, more advanced multi-threaded processors may support a much larger number of thread groups, such as, e.g., 64. With that many threads groups, two main problems arise.
First, the amount of hardware resources allocated to the thread groups must be scaled in proportion to the number of thread groups. For example, the amount of RAM space allocated across thread groups could grow linearly with the number of groups, meaning that the addition of thread groups would necessitate a corresponding increase in RAM space. Second, the size of the scheduler must be scaled in proportion to the number of thread groups. For example, the transistor budget for the scheduler could grow linearly with the number of thread groups (or other entities competing for scheduling) on the multithreaded processor, meaning that the scheduling for a larger pool of thread groups would require a concordant increase in transistor cost for a fixed thread group size.
As the foregoing illustrates, what is needed in the art is a more effective technique for scheduling larger numbers of thread groups for execution on a multithreaded processor.