A CPU executes various kinds of instructions. One of the most common instructions executed by a CPU is a memory load (LD) instruction. The operations that may be associated with an LD instruction are executed in an LSU of a CPU that interfaces directly with a level 1 data (L1D) cache. Reducing the latency of LD instructions is critical for achieving high-performance CPU execution. The latency of an LD instruction in most CPUs varies between 3 to 5 cycles. Typically, such multi-cycle latency involves various complex operations that include an address lookup in a translation lookaside buffer (TLB), a tag index lookup in an L1D cache, a compare of a tag physical address, a data read of the L1D cache, and an alignment update of the data value that has been read from the L1D cache.
A CPU may execute an LD instruction that may drive, or cause, an address generation unit (AGU) to generate an address for an LD instruction that is immediately subsequent. That is, the address of the subsequent LD instruction (referred to herein as a consumer LD instruction) is dependent on the previous memory load operation (referred to herein as a producer LD instruction). For example, consider the following two LD instructions: LDR r0, [r1] and LDR r2, [r0]. In this example, the second LD instruction is immediately subsequent to the first LD instructions. Although the two instructions appear as two separate operations; in this case the first (producer) LD instruction performs a first LD operation and generates (produces) the memory address for the second (consumer) LD operation.
If an LD instruction drives the address generation for an immediately subsequent dependent LD instruction, the latency of each LD instruction sequentially combines to be the entire latency for both LD instructions. Thus, the latency of dependent memory load operations is critical to the performance of a CPU.