1. Field of the Invention
The present invention relates generally to SIMD processing, and more particularly to SIMD operations utilizing register operations only.
2. Description of Related Art
Those of skill in the art are familiar with single instruction multiple data (SIMD) architectures. The instructions in the instructions sets used with these architectures operate on a plurality of operands with the same operation.
For example, floating point registers FP0, FP1 (FIG. 1) are used to store source operands A0 to An, and B0 to Bn, respectively. For a particular function op, each source operand A_s, where s ranges from 0 to N, in register FP0 interacts with an identically positioned source operand B_s, where s ranges from 0 to N, in other register FP1 to produce a result, that is stored in a corresponding location in result register R. For example, a function op is performed on source operand A0 and source operand B0 and result A0opB0 is placed in a corresponding location in register R.
One of the functions op that was used in the prior art was a compare function. Typical compare functions were “greater than,” “less than or equal to,” “equal,” “not equal,” “less than,” and “greater than or equal to.” FIG. 2A is an illustration of a compare operation for 16-bit source operands A0 to A3 and B0 to B3 in 64-bit floating point registers FP0 and FP1, respectively.
For any one of the compare functions in FIG. 2, operand A0 is compared with operand B0 and if the comparison is true, bit 3 in result register R is set to one, and if the comparison is false bit 3 is set to zero. Thus, the result of the comparison operation is available in bits zero to 3 of register R. Equivalent operations are defined for 32-bit source operands and 8-bit source operands.
Another prior art function was the maximum function, which selected the maximum of two source operands and placed the result in the corresponding location in the result register. One way to implement the maximum function is given in Table 1.
TABLE 1cmpgt32FP0, FP1, MASKstoreFP1, [Address]partial storeFP0, [Address], MASKldf[Address], R
As illustrated in FIG. 3A, execution of compare instruction fcmpgt32 compares operand A0 in register FP0 with operand B0 in register FP1. If operand A0 is greater than operand B0, bit one in register MASK is set to 1 and otherwise to zero. Similarly, if operand A1 is greater than operand B1, bit zero in register MASK is set to 1 and otherwise to zero. For purposes of an example, assume that operand A0 is greater than operand B0 and operand A1 is not greater than operand B1. For this example, register MASK stores “10” in bits one and zero, respectively.
Instruction store (Table 1) stores the value in register FP1 at location [Address]. (See FIG. 3B). Instruction partial store used the values in register MASK to determine which operands in register FP0 to store in location [Address].
In this example, a bit that is one in register MASK indicates that the corresponding operand in register FP0 is the largest. For each one in register MASK, the corresponding operand in register FP0 is stored in the corresponding location at location [Address]. In this example, operand A0 is stored as illustrated in FIG. 3C.
Instruction ldf loads the value at location [Address] in result register R. Thus, determining the maximum required a scratch memory location and three memory operations, which are undesirable.
One approach to reducing the memory operations was to write the result of the compare function to a special graphics condition codes register gcc. To take advantage of register gcc, a new conditional move instruction cmove was defined that used register gcc. The instruction sequence in TABLE 2 obtains the same result as the instruction sequence in TABLE 1.
TABLE 2cmpgt32FP0, FP1cmove32FP0, FP1, R
In this example, instruction fcmpgt32 does the operand by operand comparison, as described above, and configures a corresponding bit of register gcc based upon the result of the comparison of each pair of operands. Instruction cmove32 used register gcc as a mask and if a bit in register gcc is a one moves the corresponding operand from register FP0 into the corresponding location in register R and if the bit is a zero, moves the corresponding operand from register FP1 into the corresponding location in register R. See FIG. 4. Thus, the memory accesses and the scratch memory requirement associated with the operations associated with the execution of the computer program instructions of Table 1 have been replaced with register operations and a corresponding enhancement in performance.
While the use of instruction cmove with register gcc enhanced performance, in some situations, a bottleneck developed. Consider the following computer code segment:
cmpxx X1, Y1
cmpxx X2, Y2
cmpxx X3, Y3
cmpxx X4, Y4
cmpxx X5, Y5
cmov X1, Y1
where xx is any of the comparisons described above, e.g., “greater than,” “less than or equal to,” “equal,” “not equal,” “less than,” and “greater than or equal to.”
For illustration purposes, assume instruction cmpxx has a five-cycle latency. Once instruction cmpxx X1, Y1 is started, none of the other compare instructions can start until instruction cmov completes. The other compare instructions are stalled waiting for register gcc to become available.