1. Field
The described embodiments relate to computing devices. More specifically, the described embodiments relate to using predictions for store-to-load forwarding in a computing device.
2. Related Art
Many modern computing devices include a core (e.g., a central processing unit (CPU) core, a graphics processing unit (GPU) core, an Accelerated processing unit (APU) core, etc.) with a store buffer that is used to conceal, from processing circuits in the core, the latency associated with writing data back to a memory hierarchy connected to the core (where the memory hierarchy includes one or more caches and/or memories). In these cores, as a store is retired from processing circuits and data for the store is ready to be written to the memory hierarchy, the processing circuits write the store data to an entry in the store buffer. The processing circuits then proceed with subsequent computational operations as if the store data has been written back to the memory hierarchy. However, the store data remains buffered in the store buffer until the memory hierarchy is available for the store data to be written to the memory hierarchy (e.g., until a cache is not busy), thereby concealing the latency of the memory hierarchy from the processing circuits in the core.
In such cores, while the data is buffered in the store buffer, the data can be forwarded from the store buffer to the processing circuits. This forwarding ensures that the processing circuits receive the most recent, and therefore correct, version of data. In some cores, forwarding is achieved by, when data is to be loaded (e.g., in response to a load instruction), simultaneously (in parallel) sending requests for the data for satisfying the load to both the store buffer and to the memory hierarchy. Then, if data is returned (i.e., forwarded) from the store buffer, the data returned from the store buffer is used to satisfy the load. Otherwise, if no data is returned from the store buffer, data returned from the cache is used to satisfy the load. However, sending requests to both the store buffer and the cache as described is inefficient (in terms of power usage, computational effort, and communication bandwidth) when the data is available in the store buffer.