The present invention relates generally to improved array processing using multi-cycle execution units in a single instruction multiple data stream (SIMD) very long instruction word (VLIW) array processor.
In an architecture, such as the manifold array (ManArray) processor, VLIWs are created from multiple short instruction words (SIWs), stored in a VLIW memory (VIM), and executed by an SIW execute VLIW (XV) instruction. The pipeline used in the processor is a dynamically reconfigured pipeline which supports a distributed VIM in each of the processing elements (PEs) in the array processor. See, for example, xe2x80x9cMethods and Apparatus to Dynamically Reconfigure the Instruction Pipeline of An Indirect Very Long Instruction Word Scalable Processorxe2x80x9d U.S. patent application Ser. No. 09/228,374 filed Jan. 12, 1999, now U.S. Pat. No. 6,203,328, and incorporated by reference herein in its entirety.
The execution phase of the pipeline is relatively simple consisting of either single or dual execution cycles depending upon the instruction. This pipeline works fine for relatively simple instruction types, but has certain limitations in its support of more complex instructions which cannot complete their execution within a two-cycle maximum limit specified by an initial ManArray implementation. A VLIW processor, having variable execution periods can cause undesirable complexities for both implementation and for programming. It thus became desirable to solve the problem of how to add more complex instruction types in a SIMD array indirect VLIW processor such as the ManArray processor to support the evolution of this processor to a further range of applications.
The present invention describes advantageous techniques for adding more complex instructions and their consequent greater than 2-cycle multi-cycle execution units within a SIMD VLIW framework. Each PE in the array processor supports the technique and a single XV instruction can initiate several multi-cycle instructions to begin execution. In one aspect, the invention employs an initiation mechanism to also act as a resynchronization mechanism to read the results of the greater than 2-cycle multi-cycle execution. This multi-purpose mechanism operates with an SIW issue of the multi-cycle instruction, in the sequence processor (SP) alone, within a VLIW, and across all PEs individually or as an array of PEs. In addition, the multi-cycle instruction is an SIW which can be encapsulated within a VLIW and loaded indirectly with a load VLIW (LV) instruction and cause its execution to begin with an XV instruction.
The multi-cycle instruction, which by definition takes greater than 2-cycles to complete, is allowed to execute within one of the existing execution unit modules, but independently of the other module SIW instructions. The results of the multi-cycle instruction are stored in a separate storage register at completion of its operation. This approach is different than the normal single or dual execution cycle instructions that write their result data to the compute register file (CRF) at completion of the execution cycle. Upon receipt of the next multi-cycle SIW in the SP or any PE, whether it be in a VLIW or to be executed as an SIW, the contents of the multi-cycle instruction result register are transferred to the target register specified in the multi-cycle SIW. This approach allows complex execution units supporting different numbers of execution cycles to coexist within the same execution unit and within the same programming model. For example, a divide and square root unit, supporting multiple instruction types, is used in the SP and each PE in the ManArray processor with the following execution latencies for an exemplary implementation:
dual 16-bit Integer Divide - - - 6-cycles
32-bit Integer Divide - - - 10-cycles
Single Precision Floating Point Divide - - - 8-cycles
Single Precision Floating Point Reciprocal - - - 8-cycles
Single Precision - - - 8-cycles
Single Precision Floating Point Reciprocal Square Root - - - 16-cycles For implementation reasons, the divide square root unit takes the indicated number of execution unit cycles to complete before another divide and square root type of instruction can be issued to the unit. In one aspect of the present invention, the programming model takes the execution latencies into account when scheduling new instruction dispatching. The divide square root unit instructions are all advantageously implemented in a single execution module within a data select unit (DSU) as addressed further below, but the technique outlined is not limited to this design approach. More generally, in accordance with the present invention, a complex multi-cycle instruction can be instantiated within any of the VLIW execution unit slots.
These and other features, aspects and advantages of the invention will be apparent to those skilled in the art from the following detailed description taken together with the accompanying drawings.