Field of the Invention
The present invention generally relates to multithreaded processing and, more specifically, to low overhead thread synchronization using hardware accelerated bounded circular queues.
Description of the Related Art
A conventional central processing unit (CPU) typically supports multithreaded processing and often provides various mechanisms for synchronizing concurrently executing threads, including mutexes and semaphores. However, a conventional parallel processing unit (PPU), such as a graphics processing unit (GPU), may not provide similar synchronization mechanisms as those commonly provided by a CPU. A conventional GPU implements a hardware scheduler that schedules threads for execution, but the hardware scheduler typically cannot cause threads to synchronize without rescheduling those threads. Although GPU hardware does support coarse synchronization mechanisms, including e.g. thread group-wide synchronization barriers, such approaches are not capable of synchronizing individual threads. Consequently, developers of multithreaded programs designed for execution on PPUs oftentimes rely on various workarounds in order to provide basic synchronization mechanisms.
One common workaround is to program a thread executing on a PPU to poll a conditional value in order to synchronize with another thread configured to update that conditional value. For example, a given thread that must wait for some other thread to exit before resuming processing could be programmed to wait to resume execution until that other thread modifies a particular register value. The given thread would then poll the register value and, upon detecting that the register value has been modified, resume processing. With this approach, the given thread and the other thread may synchronize their operations.
However, this solution suffers from two drawbacks. First, causing a thread to poll a register is typically power inefficient because the thread executes the same portion of code repeatedly to implement polling without accomplishing any useful work. Second, while continually polling, the thread retains control over various resources allocated to that thread, including arithmetic logic units (ALUs) and load-store units (LSUs), thereby preventing other threads from using those resources to perform useful work.
A possible optimization for the polling-based thread synchronization approach described above in the context of PPU-based multithreaded processing is to implement priority-based scheduling. With priority-based scheduling, a low-priority thread may be scheduled to “wake up” and poll a corresponding conditional value less frequently compared to other higher-priority threads. Such an approach may be slightly more power efficient than a polling procedure, such as that described above, but problems may arise if low-priority threads are continuously re-scheduled and never allowed to complete due to a continuous emergence of higher-priority threads. In these situations, system deadlock may occur.
As the foregoing illustrates, what is needed in the art is an improved technique for synchronizing threads executing on a PPU.