1. Field of the Invention
The present invention generally relates to graphics processing unit (GPU) architectures, and, more particularly, to an approach for a context switching of lock-bit protected memory.
2. Description of the Related Art
A common practice in graphics processing units (GPUs) is to include one or more shared memories that support atomic operations. An atomic operation is one where a first processor reads from a memory location and subsequently writes a new value to the same memory location while other processors and input/output devices are prevented from accessing the same memory location until the first processor completes the read and the subsequent write to the shared memory. Such an atomic operation may be called a read-modify-write operation. Atomic operations ensure that a processor can perform as a read-modify-write as an undivided operation.
One approach to implementing atomic operations is to provide a load-lock and a store-unlock instruction pair. When a processor performs an atomic operation that is associated with a certain memory location, the processer first executes a load-lock instruction directed to the memory location. The load-lock instruction reads the memory location, and simultaneously secures a lock on the memory location. The lock prevents other processors and I/O devices from accessing the memory location. The processor then modifies the value read from the memory location, as desired, and executes a store-unlock instruction. The store-unlock instruction writes the modified value to the memory location, and releases the lock. Once the lock is released, other processors and I/O devices may access the memory location. If a processor is denied access to a memory location due to a lock, then the processor continues to attempt to access the memory location until the lock is released and the access is successful. This type of approach ensures that processors and I/O devices successfully perform atomic operations, because a locked memory location may not be modified by other processors until the read-modify-write operation completes, and the memory lock is released.
One drawback to the above approach is that an instance of a process, known as a thread, may be preempted while the thread is in the middle of an atomic operation. In other words, a first thread may execute a load-lock instruction, securing a lock on a shared memory location, and then be preempted by a second thread prior to executing the store-unlock instruction. In such cases, when the first thread is preempted, the processor stores the context of the first thread and begins execution on the second thread. Eventually, the context for the first thread is restored, and the first thread completes the atomic operation. During the time of pre-emption, however, the first thread retains the lock on the shared memory. Thus, other threads are prevented from accessing the locked memory location for an indeterminately long period of time, resulting in a loss of performance. Further, one memory lock may be shared between thread groups accessing different memory locations. If the first thread cannot resume execution until the second thread secures the lock held by the first thread, then a deadlock may occur. With deadlock, the second thread continues, unsuccessfully, to acquire the lock retained by the first preempted thread, and the first thread never resumes execution. As is well-appreciated, deadlock negatively impacts overall performance.
As the foregoing illustrates, what is needed is a more effective way to perform atomic operations in a multi-threaded processing architecture.