A processor or microprocessor (popularly and conveniently referred to as a Central Processing Unit or “CPU”) may have a Load Store (LS) unit having an associated LS scheduler which picks memory instructions to execute. To reduce instruction execution time, modern CPUs store copies of frequently-used data into smaller, faster memories so as to avoid delays associated with accessing slower system memory (e.g., a Random Access Memory or “RAM”) for data. These faster memories are referred to as caches that may co-exist with a processor's processing core on the same chip, thereby significantly reducing data access time. Different independent caches may be organized as a hierarchy of cache levels—i.e., Level 1 (or L1) cache, Level 2 (L2) cache, Level 3 (or L3) cache, etc., with the lowest level cache (i.e., L1 cache) being accessed first before moving on to the next level of cache. If there is an L1 cache “hit” for a memory instruction, the associated data is returned to the execution units. When the memory instruction “misses” in the L1 cache, a miss request is allocated into a Fill Buffer (FB) and a Replay Queue (RQ), and the miss request is then sent to the next (higher) level cache L2 or to the system bus (e.g., to access the system memory). The data being returned from the L2 cache (or the system bus) for the miss request is written back into the Fill Buffer and queued up for subsequent filling into the L1 cache.
When the data is being returned by the next level L2 cache or the system bus, there are two choices to handle the Load (Ld) instruction sitting in the Replay Queue:
(1) Stall the pick (of the Ld instruction that created the miss request) from the Load Store scheduler or RQ so that the data coming from L2 or bus can be first written into the FB and then into the L1 cache. Here, the Ld is held back in the LS scheduler/RQ until the data in the FB has been written into the L1 cache. The Ld instruction that caused the miss is then “woken up” from RQ/LS scheduler and gets its data from the L1 cache. This approach leads to a sub-optimal performance.
(2) Capture the data into the Fill Buffer and then forward the data from the Fill Buffer. Here, the Ld instruction in the RQ/LS scheduler is “woken up” and the instruction starts forwarding the data from the FB while the data from the L2/bus is being written/captured into the FB (and not into the L1 cache). Thus, the Ld instruction gets its data from the FB and completes its execution. At some later point in time, when the L1 cache is idle, the FB data is then transferred to or written into the L1 cache. This leads to higher performance because the LS scheduler/RQ are not interrupted to write data from the L2 cache/bus into the L1 cache (through the FB); the L1 cache remains free to service Load (Ld)/Store (St) instructions from the LS scheduler or RQ, and is not interrupted by writes from the FB.
In the option-2 above, subsequent load instructions that miss in the L1 (e.g., because the FB data is not yet transferred to the L1 cache), but hit in the FB, can forward the associated data from the FB. Hence, the option-2 above may be referred to as “Fill Buffer forwarding” or “FB forwarding.”