Field of the Invention
The present invention generally relates to computer science and, more specifically, to techniques for CPU-to-GPU and GPU-to-GPU atomic operations.
Description of the Related Art
A typical computer system includes a central processing unit (CPU) and one or more parallel processing units (GPUs). Some advanced computer systems implement a unified virtual memory architecture common to both the CPU and the GPUs. Among other things, the architecture enables the CPU and the GPUs to access a physical memory location using a common (e.g., the same) virtual memory address, regardless of whether the physical memory location is within system memory or memory local to the GPU.
During execution of a software application, memory pages that are shared between processing units may receive simultaneous access requests. To ensure deterministic results when operating on data that is shared between processes, many software algorithms employ atomic operations. If atomic operations are supported, then the computer system ensures that an atomic operation to a memory location is not interrupted by other operations to the memory location. For instance, suppose that a read-modify-write atomic operation to a memory location were to be performed. In such a scenario, the computer system would typically prevent other accesses to the memory location while the process executing the atomic operation would read a value from the memory location, modify the value, and write the modified value to the memory location.
If not handled appropriately, atomic operations to a shared memory page may result in unexpected or unintended results. For instance, suppose that a particular memory location were to specify a value of 10. Further, suppose that a CPU and a GPU were to perform an increment atomic operation on the memory location in parallel. If the GPU and CPU are permitted to simultaneously access the memory page, then the resulting memory location could be updated to a value of 11 instead of the desired value of 12.
In one approach to supporting atomic operations in a computer system, atomic operations are handled on a localized basis. Typically, multiple cores in the CPU may execute atomic operations and specialized hardware included in the CPU will ensure the integrity of the results. Further, each GPU includes functionality to mediate atomic operations within that particular GPU. One limitation to this approach is that the indivisibility of such ostensibly atomic operations is not assured across a computer system. Notably, the atomicity of an “atomic” operation initiated by a particular GPU is preserved with respect to the particular GPU, but is not ensured with respect to the CPU or any other GPUs. Consequently, atomic operations such as read-modify-write may not execute correctly across computer systems where data may be accessed by multiple processing units.
In another approach to the above problem, if a particular processing unit attempts to perform an atomic operation on a memory page, then a computer system asserts a lock signal on a bus, such as the PCIe bus. This lock ensures that other processing units do not have access to the memory page during the atomic operation. One limitation to this approach is that such a lock on the bus prevents other bus processes from occurring while the particular processing unit performs the atomic operation. This leads to inefficiencies in system operation and therefore undermines overall system performance.
As the foregoing illustrates, what is needed in the art is a more effective approach to managing atomic operations in a system that implements a unified memory architecture.