1. Field of the Invention
The invention relates generally to the field of processor chips and specifically to the field of single-instruction multiple-data (SIMD) processors. More particularly, the present invention relates to conditional and nested vector operations in a SIMD processor.
2. Description of the Background Art
In a scalar processor when conditional operations such as if-then-else high-level language constructs are implemented, the operation may simply be implemented by changing flow of the program execution to skip over instructions for which the tested condition is false. For example, if we have the following code sequence:
Instruction 1: if (X>Y)
Instruction 2: A=B+C;
Instruction 3: else
Instruction 4: A=(B−C)/2;
In instruction 1 if X-greater-than-Y condition is true, then instruction 2 is executed, and a branch or jump instruction is used to skip over instruction 4. Alternatively, if X greater than Y condition is false, the instruction 2 is skipped over and program execution continues with instruction 3.
However, in a vector processor there are multiple elements in a vector, and hence when we compare two vectors, the tested condition could be true for certain elements and false for other elements. Since all elements are operated on by the same SIMD instruction, the instruction execution flow of the processor may not be modified as it is done in a scalar processor. Furthermore, even if program flow could be changed, this would be costly in time, because each branch instruction typically takes 3 or more cycles to execute.
Intel's SSE and MMX handle this by using compare-and-set mask instruction. This instruction sets all bits of an element to 1s if condition corresponding to that element position is true, and to all zeros otherwise. This mask and its generated inverse have to be logically ANDed with a number, which can then be used to implement conditional vector operations. For example, Intel SSE has four 32-bit elements in a 128-bit vector. Let us suppose we wanted to compare each of the four values in a XMM register to zero:
C Instruction 1:if (xmm [i] > 0)C Instruction 2:xmm [i] = xmm [i] + 1;C Instruction 3:elseC Instruction 4:xmm [i] = xmm [i] − 1;Intel's assembly equivalent shown below uses SIMD compare and logical operations without any branching.
Assembly Instruction 1:movapsxmm3, [one]Assembly Instruction 2:movapsxmm4, [minusone]Assembly Instruction 3:movapsxmm0, [convert]Assembly Instruction 4:movapsxmm1, xmm0Assembly Instruction 5:cmpltpsxmm1, [zero]Assembly Instruction 6:andpsxmm4, xmm0;Assembly Instruction 7:andnpsxmm0, xmm3;Assembly Instruction 8:addpsxmm1, xmm4;Assembly Instruction 9:addpsxmm1, xmm0;The following explains the above assembly code.
Assembly Instruction 1:Loads four copies of 1 into xmm3Assembly Instruction 2:Loads four copies of −1 into xmm4Assembly Instruction 3:Loads four input values into xmm0.Assembly Instruction 4:Since xmm0 is overwritten, it is savedinto xmm1.Assembly Instruction 5:Compares four values of xmm0 with zero.Values greater than zero are changedto mask of zero.Values less than zero changed to maskof all ones.Assembly Instruction 6:“-1” → Elements that are set to ones.Assembly Instruction 7:“+1” → Elements that are set to ones.Assembly Instruction 8:Adds −1 to certain elements.Assembly Instruction 9:Adds +1 to certain elements.
The disadvantage of this approach is that it takes a lot of vector instructions to implement, and also has no provision for handling nested vector if-then-else constructs. Several vector registers has to be used to save all the conditions, and therefore, it is likely that there will not be enough vector registers, and some has to be loaded from memory. Overall, due to the overhead of many instructions the performance is degraded significantly.
Spielman used a stack to store condition flags. Each processor element has an enable value and history values stored on stack, and stack and state handling instructions are used to retrieve the status of element enables in nested conditional constructs. Inversion of condition is also required for the else portion. This is an improvement over the Intel's approach, but still requires many instructions and is not optimal. Our present invention requires only one vector compare instruction for both if and else parts of the conditional construct, and no stack and stack-handling instructions are needed. Furthermore, only small number of bits is required in instruction opcode to enable conditional vector instruction execution.