In an architecture, such as the manifold array (ManArray) processor architecture, very long instruction words (VLIWs) are created from multiple short instruction words (SIWs) and are stored in VLIW memory (VIM). A mechanism suitable for accessing these VLIWs, formed from SIWs 1-n, is depicted in FIG. 1A. First, a special kind of SIW, called an "execute-VLIW" (XV) instruction, is fetched from the SIW memory (SIM 10) on an SIW bus 23 and stored in instruction register (IR1) 12. When an XV instruction is encountered in the program, the VLIW indirectly addressed by the XV instruction is fetched from VIM 14 on a VLIW bus 29 and stored in VLIW instruction register (VIR) 16 to be executed in place of the XV instruction by sending the VLIW from VIR 31 to the instruction decode-and-execute units.
Although this mechanism appears simple in concept, implementing it in a pipelined processor with a short clock period is not a trivial matter. This is because in a pipelined processor an instruction execution is broken up into a sequence of cycles, also called phases or stages, each of which can be overlapped with the cycles of another instruction execution sequence in order to improve performance. For example, consider a reduced instruction set computer (RISC) type of processor that uses three basic pipeline cycles, namely, an instruction fetch cycle, a decode cycle, and an execute cycle which includes a write back to the register file. In this 3-stage pipelined processor, the execute cycle of one instruction may be overlapped with the decode cycle of the next instruction and the fetch cycle of the instruction following the instruction in decode. To maintain short cycle times, i.e. high clock rates, the logic operations done in each cycle must be minimized and any required memory accesses kept as short as possible. In addition, pipelined operations require the same timing for each cycle with the longest timing path for one of the pipeline cycles setting the cycle time for the processor. The implications of the serial two memory accesses required for the aforementioned indirect VLIW operation in FIG. 1A is that for a single cycle operation to include both memory accesses would require a lengthy cycle time not conducive for a high clock rate machine. As suggested by analysis of FIG. 1A wherein the VIM address Offset 25 is contained within the XV instruction, the VIM access cannot begin until the SIM access has been completed. At which point, the VIM address generation unit 18 can create the VIM address 27 to select the desired VLIW from VIM 14, by adding a stored base address with the XV VIM OffSet value. This constraint means that if the number of stages in a typical three-stage (fetch, decode, execute) instruction pipeline is to be maintained, both accesses would be required to be completed within a single clock cycle (i.e. the fetch cycle). However, due to the inherent delay associated with random memory accesses, even if the fastest semiconductor technologies available today are used, carrying this requirement to the actual implementation would restrict the maximum speed, and hence, the maximum performance, that could be attained by the architecture.
On the other hand, if an additional pipeline stage were to be permanently added such that the memory accesses are divided across two pipeline fetch stages (F1 and F2), an even more undesirable effect of increasing the number of cycles it takes to execute a branch would result.