The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also correspond to embodiments of the claimed subject matter.
Generally speaking, memory closer to a CPU may be accessed faster than memory farther away. Memory within a CPU may be referred to as cache, and may be accessible at different hierarchical levels, such as Level 1 cache (L1 cache) and Level 2 cache (L2 cache). System memory such as memory modules coupled with a motherboard may also be available, such externally available memory which is separate from the CPU but accessible to the CPU may be referred to as, for example, off-chip cache or Level 3 cache (L3 cache), and so on, however, this is not always consistent as a third hierarchical level of cache (e.g., L3 cache) may be on-chip or “on-die” and thus be internal to the CPU.
CPU cache, such as L1 cache, is used by the central processing unit of a computer to reduce the average time to access memory. The L1 cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations. L2 cache may be larger, but slower to access. And additional memory, whether on-chip or externally available system memory used as cache may be larger still, but slower to access then smaller and closer CPU cache levels. As long as most memory accesses are cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory.
Conventional caches utilize a store buffer to reduce cache latency and also to enable the reading of store instructions that have not yet been written into cache. As stores go down a pipeline they store the data in a store buffer and persist until the store is retired from the pipeline, at which point the store writes the data to cache.
Conventional cache mechanisms require that store instructions be instituted through a series of operations which are executed in serial steps. Instructions are decoded and forwarded to an address generation unit, an address is calculated, and then sent to the cache which must maintain the order the instructions serially to carry out the store.
Moreover, the stored data cannot be made available to other entities until absolutely certain that the store is actually going to persist until retirement, at which point the store instruction “retires” from the pipeline thus allowing the stored data to be written from a store buffer to the cache location and it is at this post retirement stage of operation that the data is considered valid.
Because an out of order machine or processor executes instructions “out of order,” it cannot be known with certainty whether any given instruction will be part of a validly executed path. For example, where an instruction is executed ahead of a branch, there is a risk that such an instruction will never be used, should a branch mis-prediction occur. Thus, problems arise with data integrity if a store writes data into a cache before it is known with certainty whether or not the store instruction will retire from the cache, and thus correspond to valid data. A store instruction which never retires, yet writes to the cache, causes invalid data to be written to the cache and thus creates a data integrity problem. Unfortunately, retirement occurs at a late stage, and thus, induces cache latency for such store instructions.
Improvements to cache latency (e.g., reductions in cache latency) provide direct and immediate benefits to computational efficiency for an integrated circuit utilizing such a cache. Lower latency means that data required by, for example, a CPU pipeline is available sooner without having to expend cycles waiting for unavailable data.
The present state of the art may therefore benefit from systems and methods for cutting senior store latency using store prefetching as described herein.