The present invention relates generally to improvements in digital processing and more particularly to methods and apparatus for supporting conditional execution in a very long instruction word (VLIW) based array processor with sub-word execution.
Conditional execution, also referred to as predicated execution, provides the programmer the ability to specify for a non-branch type of instruction whether it is to execute or not based upon a machine state generated previously. This data-dependent conditional execution capability minimizes the need for conditional branches. By avoiding the use of branches, which incur a branch delay penalty on pipelined processors, performance is improved. In addition, it is noted that many types of sequential control dependencies can be turned into parallel data dependencies. Consequently, it is desirable that a pipelined SIMD array processor support conditional execution in each processing element (PE) to provide a level of data-dependent parallelism unavailable on a Single Instruction Multiple Data stream (SIMD) machine that only supports conditional branching. With parallel conditional execution, the performance gain can be significant since multiple conditional branches can be avoided.
In creating the architecture of a parallel array indirect VLIW processor for a given range of operations it is found that the format needed to specify the operations varies in requirements depending upon the type of operation. For example, the parallel array operations can be grouped into three types, control and branch operations, load and store operations, and arithmetic operations. Each of these types will have different encoding requirements for optimum implementation. Since the instruction format typically is of a fixed number of bits, it is difficult, without restricting functional capabilities for at least some of the operations, to define a mechanism supporting a single specification for conditional execution across all instructions in a processor. Given that it is desirable to support conditional execution, even if the degree of support must vary depending upon the instruction type, a problem is encountered on how to define a unified but variable-specification conditional execution mechanism based upon the instruction type.
For conditional branching or conditional execution to be more efficient, it is desirable that the conditional operation be based on complex conditions that are formed by a Boolean combination of relations such as [a greater than b OR c less than d]. This may be accomplished by sequentially using multiple single-test conditional branches that effectively achieve the desired result. The problem associated with using multiple single-test conditional branches is that there is a performance decreasing effect for each branch required due to the branch delay penalty. This performance decreasing effect can be reduced with non-branching complex conditional execution.
In machines with a SIMD architecture, it is desirable to generate independent conditional operations in the PEs as well as to transfer condition information between PEs to allow the gathering of conditional state information generated in the PEs. It is also desirable to provide conditional branching in the controller, sequence processor (SP), of a SIMD array processor where the conditions are created in the array PEs. By allowing condition-state information to be moved between PEs, a condition producing operation can take place in one PE and a conditional operation based upon the conditional result to take place in another PE. By allowing conditional information to be moved between the PEs and the SP, a conditional operation can take place in the SP based upon PE conditions. How to best add such capability into the architecture raises further issues.
In VLIW machines, a plurality of execution units exist that may execute in parallel, with each execution unit possibly producing condition information or state information for each sub-instruction of the multi-instruction VLIW. To make a data dependent conditional execution decision, it is necessary to reduce the total amount of machine state to the desired test condition. It is also desirable to have a mechanism to select condition results from one of the multiple execution units to control the execution of one or more of the other execution units. An example of this type of situation is a compare instruction followed by a conditionally dependent shift instruction where the compare is performed in a different execution unit than the shift. Consequently, the problems to be solved are how to reduce the amount of condition information to a specified test condition and how to provide a mechanism for interdependent conditional execution between the multiple execution units that operate in synchronism in a VLIW machine.
Sub-word execution refers to the multiple individual operations that simultaneously take place on pieces of data smaller than a word or double word within a single execution unit. The aggregate of the multiple sub-word operations are referred to as packed data operations, where for example quad 16-bit operations or octal 8-bit operations occur in parallel on packed 64-bit data types. When performing sub-word execution in a machine that supports conditional execution, it is desirable to achieve a sub-word level of conditional execution granularity when executing the instruction. The question is how to support such a capability in the architecture.
The present invention advantageously addresses such problems, preferably utilizing a ManArray architecture, by providing a hierarchical conditional execution specification based upon instruction type, support in the controller Sequence Processor (SP) and PEs for complex conditions based upon present and previous condition state, a mechanism to distribute condition state information between the PEs and SP, a mechanism for interdependent conditional execution between the multiple execution units in a VLIW machine, and a mechanism for sub-word conditional execution.
In the ManArray architecture, as presently adapted, a three level hierarchical specification is used where one, two, or three bit conditional execution specifications are used in the instruction formats depending upon the instruction type and format encoding restrictions. The condition state to be operated upon, as specified by these bits, is a reduced set of state information separately produced from the normal side-effect state generated in parallel by executing instructions, be they packed data or VLIW operations. Conceptually, the normal side-effect state generated from an instruction execution is saved in the arithmetic scalar flags (ASFs), namely carry (C), overflow (V), sign (N), and zero (Z) flags. Some restrictions apply depending upon the data type. The separately produced conditional state is saved in the arithmetic condition flags (ACFs), namely F7-F0, where Fi corresponds to packed data element i. The ASFs can only be used for conditional branching while the ACFs are used in both conditional branching and for conditional execution. In addition, the ACFs contain state information that is set as a result of an instruction execution or set as a result of a Boolean combination of state information generated from a present compare instruction and previous instruction execution. These ACFs can be specified and tested for in the SP by conditional instructions thereby minimizing the use of conditional branches. In the simplest case, PE instructions may conditionally execute and SP instructions may conditionally execute or branch on the condition results of the immediately preceding instruction. If the immediately preceding instruction did not affect the flags, general conditional execution is based on the condition results of the last instruction that affected the ACFs or a Boolean combination of condition state information.
The ManArray, when constructed, programmed and operated in accordance with the present invention, uses the convention of the programmer specifying either how the ACFs are set by the instruction generating the condition or how to use the ACFs, rather than only specifying how to use the ACFs with an instruction operating on a condition. This convention produces a single True or False flag that contains a 1 or a 0 designated Fn per operation. For compare instructions, the programmer must specify which condition state, greater-than, equal, less-than, etc., to use in setting the ACFs. In addition, compare instructions operating in the arithmetic logic unit (ALU) can specify the setting of the flags based upon a Boolean combination of the present compare result state and past instruction ACF state. For arithmetic operations, in one embodiment of the ManArray architecture in accordance with the present invention, the ability to select how to update the ACF condition flags using one of the four ASF conditions C (carry flag), V (overflow flag), N (negative flag), or Z (zero flag) on an instruction by instruction basis is advantageously provided.
When executing VLIW operations, the programmer must select which of the arithmetic units is allowed to affect the single set of ACFs. The single set of flags can be used in VLIW execution to conditionally control the execution of each of the VLIW units. During each cycle, the ownership and setting of the condition flags is dynamically determined by the instruction in execution. Conditions that occur but are not selected to affect the ACFs or that affect the programmer""s visible ASFs cause no effect and are not generally saved.
Another aspect of one embodiment of the ManArray instruction set is that instructions that execute conditionally do not affect the condition flags themselves. This feature gives the programmer the ability to execute C-style conditional expression operators of the form (a greater than b) ? z=x+y:r=q+s without worry that the first instruction after the comparison will alter the flags producing an undesired result. An instruction may either specify to conditionally execute based upon the ACFs or specify how to set the ACFs but not both.
It is desirable to have an efficient mechanism for means to generate complex conditions in each PE that can be specified and tested for by conditional instructions. This has the effect of changing SP conditional branches into PE data dependent execution operations. Having an effective means for parallel array conditional execution minimizes the need to have the PEs send condition signals back to the controller, which takes time and implementation expense, for the purposes of supporting conditional branching based on PE conditions. An implication of having parallel array conditional execution is that the approach chosen for providing PE condition feedback to the array controller can be simple in nature and less costly than providing condition signaling paths from each PE. By saving the condition flags in a programmer accessible register space that can be copied or moved to a PE""s register file the flags can be easily communicated between PEs. In conjunction with a merged SP/PE as described more fully in U.S. application Ser. No. 09/169,072 filed Oct. 9, 1998, 6,219,776 entitled Methods and Apparatus for Dynamically Merging an Array Controller with an Array Processor Element, flags saved in PE0 are easily transferred to the SP. Using a log N reduction method, where N is the number of PEs in the array, it is possible to exchange PE flag information between all PEs in log N steps. The transfer of condition information is consistent with the design of the existing ManArray network and does not require the addition of condition signaling paths between the PEs and the SP controller.
With a need by many applications for conditional sub-word execution, the three bit form of conditional execution specifies, for specific instructions or specific groups of instructions, that the instruction is to operate only on the data elements of a packed data type that have a corresponding ACF of the appropriate value for the instruction specified true or false test.