1. Field of the Invention
The present invention relates generally to the field of data processing and, more specifically, to cache-based control of atomic operations in conjunction with an external ALU block.
2. Description of the Related Art
One element of a memory subsystem within certain processing units is a Level 2 Cache memory (referred to herein as “L2 cache”). The L2 cache is a large on-chip memory that serves as an intermediate point between an external memory (e.g., frame buffer memory) and internal client of the memory subsystem (referred to herein as the “clients”). The clients transmit many downstream operations to the L2 cache in a parallel processing architecture. Some of these operations are simple load/store operations, where the L2 cache temporarily stores data that the clients are reading from and writing to the external memory (usually a DRAM store), while other operations transmitted from the clients involve further computational operations on the data coming from the clients. Many of the operations involving further computation are atomic in nature. The results of these operations are stored in the L2 cache before being written to external memory and may also be returned to the clients.
One consequence of the different clients generating so many downstream operations is that efficiently processing the atomic operations becomes complicated. For example, as atomicity must be maintained while the atomic operation is being processed, the space reserved for the atomic operation in the L2 cache cannot be read from or written to until the atomic operation has been completely processed. Further, operations (atomic or otherwise) that depend on accessing the L2 cache memory space reserved for the atomic operation need to be halted until the memory space is ready to be used, that is, when the atomic operation has been processed. The L2 cache needs to be able to distinguish between these dependent operations and operations that are independent of the reserved memory space in the L2 cache. The independent operations should be processed by the L2 cache normally.
As the foregoing illustrates, what is needed in the art is a mechanism for efficiently processing downstream atomic operations in a parallel processing architecture.