The present invention relates generally to improvements in array processing, and more particularly to methods and apparatus for dynamically expanding and compressing the instruction pipeline of a very long instruction word (VLIW) processor.
In an architecture, such as the manifold array (ManArray) processor architecture, very long instruction words (VLIWs) are created from multiple short instruction words (SIWs) and are stored in VLIW memory (VIM). A mechanism suitable for accessing these VLIWs, formed from SIWs 1-n, is depicted in FIG. 1A. First, a special kind of SIW, called an xe2x80x9cexecute-VLIWxe2x80x9d (XV) instruction, is fetched from the SIW memory (SIM 10) on an SIW bus 23 and stored in instruction register (IR1) 12. When an XV instruction is encountered in the program, the VLIW indirectly addressed by the XV instruction is fetched from VIM 14 on a VLIW bus 29 and stored in VLIW instruction register (VIR) 16 to be executed in place of the XV instruction by sending the VLIW from VIR 31 to the instruction decode-and-execute units.
Although this mechanism appears simple in concept, implementing it in a pipelined processor with a short clock period is not a trivial matter. This is because in a pipelined processor an instruction execution is broken up into a sequence of cycles, also called phases or stages, each of which can be overlapped with the cycles of another instruction execution sequence in order to improve performance. For example, consider a reduced instruction set computer (RISC) type of processor that uses three basic pipeline cycles, namely, an instruction fetch cycle, a decode cycle, and an execute cycle which includes a write back to the register file. In this 3-stage pipelined processor, the execute cycle of one instruction may be overlapped with the decode cycle of the next instruction and the fetch cycle of the instruction following the instruction in decode. To maintain short cycle times, i.e. high clock rates, the logic operations done in each cycle must be minimized and any required memory accesses kept as short as possible. In addition, pipelined operations require the same timing for each cycle with the longest timing path for one of the pipeline cycles setting the cycle time for the processor. The implications of the serial two memory accesses required for the aforementioned indirect VLIW operation in FIG. 1A is that for a single cycle operation to include both memory accesses would require a lengthy cycle time not conducive for a high clock rate machine. As suggested by analysis of FIG. 1A wherein the VIM address Offset 25 is contained within the XV instruction, the VIM access cannot begin until the SIM access has been completed. At which point, the VIM address generation unit 18 can create the VIM address 27 to select the desired VLIW from VIM 14, by adding a stored base address with the XV VIM OffSet value. This constraint means that if the number of stages in a typical three-stage (fetch, decode, execute) instruction pipeline is to be maintained, both accesses would be required to be completed within a single clock cycle (i.e. the fetch cycle). However, due to the inherent delay associated with random memory accesses, even if the fastest semiconductor technologies available today are used, carrying this requirement to the actual implementation would restrict the maximum speed, and hence, the maximum performance, that could be attained by the architecture.
On the other hand, if an additional pipeline stage were to be permanently added such that the memory accesses are divided across two pipeline fetch stages (F1 and F2), an even more undesirable effect of increasing the number of cycles it takes to execute a branch would result.
The present invention addresses a dynamic reconfigurable pipeline and methods of its use which avoids both of the above described types of xe2x80x9cdelayedxe2x80x9d and multi-cycle branch problems. Thus, this dynamic reconfigurable pipeline as discussed further below is highly advantageous.
A unique ManArray processor pipeline design in accordance with the present invention advantageously solves the indirect VLIW memory access problem without increasing branch latency by providing a dynamically reconfigurable instruction pipeline for SIWs requiring a VLIW to be fetched. By introducing an additional cycle in the pipeline only when a VLIW fetch is required, the present invention solves the VLIW memory access problem. The pipeline stays in an expanded state, in general, until a branch type or non-XV-VLIW type operation is detected returning the pipe to a compressed pipeline operation. By compressing the pipeline when a branch type operation is detected, the need for an additional cycle for the branch operation is avoided by the present invention. Consequently, the shorter compressed pipeline provides more efficient performance for branch intensive control code as compared to a fixed pipeline with an expanded number of stages.
In addition, the dynamic reconfigurable pipeline is scalable allowing each processing element (PE) in an array of PEs to expand and compress the pipeline in synchronism allowing independent iVLIW operations in each PE. This is accomplished by having distributed pipelines in operation in parallel, one in each PE and in the controller Sequence Processor (SP).
The present invention also allows the SIW memory and VLIW memory to have a full cycle for memory access time. This approach enables an indirect VLIW processor to achieve a higher frequency of operation because it minimizes the logic operations and number of memory access required per cycle. By using this approach, a more balanced pipeline design is obtained, resulting in a micro-architecture that is more suitable for manufacturing across a wide-range of process technologies.
These and other advantages of the present invention will be apparent from the drawings and Detailed Description which follow.