1. Field of the Invention
The present invention generally relates to parallel processing and more specifically to a thread group scheduler for computing on a parallel thread processor.
2. Description of the Related Art
A “thread group” is a set of parallel threads that execute the same instruction together in a single-instruction multiple-thread (SIMT) or single-instruction multiple-data (SIMD) fashion. A typical multithreaded streaming multiprocessor (SMP) schedules two sets of 24 thread groups, where each thread group has 32 parallel threads. The SMP schedules thread groups that are ready to execute an instruction and dispatches and executes each thread group instruction. The SMP can schedule two different thread groups for each SMP cycle.
Compute Unified Device Architecture (CUDA), Open Computing Language (OpenCL), and DirectX 11 (DX11) are computing programs that execute parallel threads on an SMP in groups of related threads known as “cooperative thread arrays” (CTAs). A CTA is a set of concurrently executing threads that can cooperate, synchronize, communicate, and share memory. The SMP implements a CTA as one or more thread groups, and can schedule and execute multiple CTAs concurrently. When executing multiple CTAs, the threads comprising each CTA must synchronize at specific “barrier” points and at CTA completion. When a given CTA completes, the resources allocated to that CTA are freed. The SMP may then launch additional CTAs.
Conventional SMPs attempt to schedule thread groups “fairly” so that each thread group makes equal progress compared to the other thread groups. Prior techniques may be effective when the SMP executes just one CTA. However, various problems arise when the SMP executes multiple CTAs.
First, if one thread of an executing CTA requires more time to reach a barrier point than other threads within that CTA, then the other threads within the CTA must wait for that one thread to complete. In this situation, the other CTAs are ineligible to execute additional instructions and cannot help hide the execution latency of the one remaining thread. Further, the SMP may be unable to start a new CTA until the last thread of the executing CTA is finished, even though the SMP may have sufficient per-thread resources at least partially unused by the executing CTA.
Second, thread groups within an executing CTA often perform similar processing operations at similar times, and so those thread groups often require the same resources almost simultaneously. This situation may result in a resource conflict between those thread groups. For example, if every thread group within the executing CTA needed to perform a math operation at the same time, then those thread groups may become bottlenecked on the math resource.
Third, when multiple CTAs are launched simultaneously within an SMP, those CTAs often exit nearly simultaneously, which may leave the SMP sitting idle until the CTA resources are reclaimed and new CTAs are launched. In practice, the process of reclaiming resources and launching new CTAs may take many tens of cycles, depending on the size of the CTA. During those cycles, the SMP may sit idle.
Finally, a conventional SMP may repeatedly allocate all execution resources to a single high-priority CTA while excluding other low-priority CTAs from being scheduled, thereby preventing the low-priority CTAs from completing. In situations where the high-priority CTA depends on the completion of the low-priority CTAs, deadlock may occur because the low-priority CTAs cannot complete without the resources held by the high-priority CTA.
Accordingly, what is needed in the art is an improved technique for scheduling thread groups for execution.