In conventional reduced instruction set computer systems in which cycle times are driven to the point where address generation becomes a significant issue in terms of time, when a machine level programming instruction requires something from a particular memory address then the fastest memory access path includes three instruction cycles. The first instruction cycle is spent generating the memory address. At the start of the second cycle, the memory is provided with the memory address so that the data will be accessed and made available at the start of the third instruction cycle.
A full instruction cycle may be needed in order to generate that memory address. For example, if the instruction identifies the memory address as a displacement from a known address, a full instruction cycle is required in order to add the displacement to the known address. Once the memory address is known, the memory can be accessed and the instruction execution completed.
Although high speed memories, e.g. high speed data caches, are conventional and can be accessed quickly, the delay caused by generating the memory address may force the instruction to be stalled by a central processing unit. A full instruction cycle must pass before the CPU may resume processing the instruction.
This delay, or stall, becomes more pronounced when the system implements VLIW technology (Very Long Instruction Word). In VLIW, a compiler concatenates several, e.g. 8-16, instructions or parcels into one. If the one very long word is stalled, then all of its instructions are stalled. "Some Design Ideas for a VLIW Architecture for Sequential-Natured Software" by Kemal Ebcioglu (published in Proceedings of IFIP WG 10.3 Working Conference on Parallel Processing (M. Cosnard et al., eds.) North Holland 1988), provides pertinent VLIW background, and is formally incorporated herein by reference.
Designing computer systems of the Von Neumann type, with high levels of performance, includes considerations such as the fastest possible cycle time for the central processing unit (CPU), instruction and data cache (I-cache and D-cache) subsystems which can be accessed with similar cycle times as the CPU ("single cycle access"), and having adequate bandwidth between the CPU and caches, i.e. the number of bytes which may be transmitted with each clock cycle. Any of these considerations taken individually may be sufficient to throttle the flow of data and instructions through the system.
As the cycle times for CPUs implemented in the fastest VLSI (Very Large Scale Integrated circuit) technologies approach their theoretical limits dictated by light speed, the number of operations which must take place along the critical path between the CPU and the cache must be restricted to a minimum.