A paradigm of reduced instruction set computer (RISC) architecture processors is that they employ a load/store architecture. That is, they include a load instruction that loads an operand from memory into a register of the processor and a store instruction that stores an operand from a register of the processor into memory. Paradigmatically, these are the only instructions that access memory. The other instructions that perform arithmetic/logical operations receive their operands from the registers and write their results to the registers. That is, the non-load/store instructions are not allowed to specify an operand in memory, which enables most of them to be executed in a single clock cycle, in contrast to a load instruction which takes multiple clock cycles to access memory (i.e., cache or system memory). Thus, a common sequence of instructions might include a load instruction that fetches an operand from memory into a first register, followed by an arithmetic/logical instruction that performs an arithmetic/logical operation (e.g., add, subtract, increment, multiply, shift/rotate, Boolean AND, OR, NOT, etc.) on the operand in the first register and writes the result to a second register, followed by a store instruction that writes the result in the second register to memory. The advantages of the load/store architecture paradigm are well known.
A natural outgrowth of the load/store architecture is that many processors include distinct load/store units that are separate from execution units that perform the arithmetic/logical operations. That is, a load unit performs only loads of data from memory into a register; a store unit performs only stores of data from a register to memory; and the arithmetic/logical execution units perform arithmetic/logical operations on operands from source registers and write the results to a destination register. Thus, using the example instruction sequence above, the load unit executes the load instruction to fetch the operand from memory into the first register, an arithmetic/logical unit executes the arithmetic/logical instruction to perform the arithmetic/logical operation on the operand in the first register (perhaps using a second operand in another register) and writes the result to the second register, and the store unit executes the store instruction that writes the result in the second register to memory.
An advantage of having the distinct load/store units and arithmetic/logical units is that they may be simpler and faster. However, a disadvantage is that valuable time is consumed in the transfer of the results between the various units through the registers. This is partly solved by forwarding buses that forward a result from an execution unit directly to another execution unit without going through the registers. However, there is still time consumed, i.e., delay, in the forwarding. The amount of time consumed is predominantly a function of the distance the signals must travel on the forwarding buses between the different execution units and RC time constants associated with the signal traces. The time delay associated with result forwarding may amount to one or more clock cycles, depending upon the layout of the execution units and the process technology of a given design.