Field of the Invention
This invention relates generally to the field of computer processors. More particularly, the invention relates to an apparatus and method for implementing a memory-hierarchy aware producer-consumer instruction.
Description of the Related Art
In a model where a CPU 101 and GPU 102 work in a producer-consumer mode with the CPU as the producer and GPU as the consumer, the data transfer between them is performed as illustrated in FIG. 1. The CPU in the illustrated example includes a multi-level cache hierarchy including a level 1 (L1) cache 110 (sometimes referred to as an Upper Level Cache (ULC)); a level 2 (L2) cache 111 (sometimes referred to as a Mid-Level Cache (MLC)); and a level 3 (L3) cache 112 (sometimes referred to as a Lower Level Cache (LLC)). Both the GPU 102 and the CPU 101 are coupled to the L3 cache and a main memory 100.
To provide data to the GPU, the CPU performs a non-temporal store to main memory. A non-temporal store in this context is a store using data which is not anticipated to be needed by the CPU in the near future. Consequently, the store is directed to main memory rather than one of the caches in the hierarchy. Non-temporal stores may be implemented using, for example, the Uncacheable Speculative Write Combining (USWC) memory type or non-temporal store instructions (e.g., MovNT store instructions). Using a USWC operation, the data is not cached but the CPU may combine data in internal Write Combining (WC) buffers in the CPU before transferring the data all the way out to main memory. USWC operations also allow reading of data from memory in an out of order manner.
Non-temporal stores are by nature weakly ordered meaning that data may be accessed in an order deviating from the order specified in program execution. For example, the program may specify the operation order “store A and then store B,” but in operation the CPU may store B and then store A. Because of this characteristic of non-temporal stores, a Fence instruction is needed to force all stores to be ordered as per program execution. The Fence instruction enforces an ordering constraint on memory operations issued before and after the Fence instruction, thereby ordering all the weakly ordered instructions from the CPU.
After the data has been successfully written to main memory and ordered using a Fence instruction, the Fence producer writes to a flag notifying the consumer (the GPU in the example) that the data is ready. The consumer observes that the flag has been written, either by polling or by other techniques such as an interrupt, and generates unordered data fetch transactions (reads) to read the data.
The foregoing approach suffers from low latency and low bandwidth because the store operations by the CPU and the read operations by the GPU must go all the way out to main memory 100. Consequently, a more efficient mechanism is needed for transferring data between a CPU and a GPU.