Processors (for example, central processing units (CPU) or cores) execute various types of instructions. Two typical types of instructions include a memory load (LD) instruction and an arithmetic or mathematical instruction (e.g., an addition (ADD) instruction). Often, to achieve high performance processor execution it is desirable to keep the latency of these instructions low.
Load instructions or operations are generally executed in a load/store unit (LSU) that interfaces directly with a level 1 data (L1D)-cache. Whereas, mathematical operations (e.g., ADD) are often executed in an arithmetic logic unit (ALU) or other mathematical execution unit (e.g., a floating-point unit (FPU)).
The latency of a load instruction in most processors typically varies between 3 to 5 cycles. Typically, such multi-cycle latency includes various complex operations, including, for example, translation lookaside buffer (TLB) address lookup, L1D-cache tag index lookup, tag physical address compare, L1D-cache data read, and alignment update of the data value. The alignment update is often involved because data is often read out of the data-cache aligned to a certain byte boundary (e.g., a particular word-size). However, the actual requested memory address may not occur at that pre-defined byte boundary (e.g., it may occur half-way through a word). Therefore, the data read out of the cache may need to be shifted in some fashion to receive the proper alignment to satisfy the load instruction. There can also be other operations performed on the data during this alignment phase, including sign extension and big-endian/small-endian manipulation.
Likewise, a mathematical instruction or operation may have its own latency from start to finish. For example, ADD instructions typically have single-cycle latency to execute the addition.
In some architectures, it is common for a load instruction to update a register value that is subsequently used as a source for a subsequent ADD instruction. The processor may execute an arithmetic operation that uses a memory read operation as a source operand. Typically, the latency of the load and mathematics instructions is the sum of the individual latencies of the individual instructions.