1. Field of the Invention
The present invention relates to a processor used as a central processing unit of a computer and, more particularly, to a processor that uses a SIMD (Single Instruction Multiple Data) method for parallel processing of multiple arithmetic operations and the processor's arithmetic instruction processing method and arithmetic operation control method.
2. Description of the Related Art
The primary ways to enhance a processor's arithmetic performance s to increase its operating frequency or improve the arithmetic performance per cycle. The arithmetic performance of present-day processors generally is improved by enhancing a combination of these two elements.
The SIMD (Single Instruction Multiple Data) method is used to improve the arithmetic performance per cycle. The SIMD system is generally configured so that an arithmetic unit having a data width of (a) bits can be used as (m) arithmetic units having a data width of (b) bits (a=bm). The instructions supporting in the mode in which multiple arithmetic units are used are called SIMD instructions.
The SIMD instructions are described, for instance, in “Intel Architecture Software Developers Manual Volume 2: Instruction Set Reference” (Intel Corporation, 1999). One example is the PMULHUW instruction, which is described on pp. 3-522.
FIG. 1 shows a typical arithmetic unit data path that is based on the conventional SIMD method. In this example, the 64-bit SIMD adder 131 comprises two 32-bit adders. The reference numeral 101 indicates a register file for sixty-four 64-bit registers. Two-register read and one-register write operations can be performed simultaneously. Signal 111 provides read/write control of the register file 101. The reference numerals 121, 122, and 161 indicate flip-flops for 64 bits.
After the control signal 111 reads two values from the register file 101, the read values are input to flip-flops 121 and 122 in synchronism with a clock signal. Subsequently, the 32 high-order bits of flip-flops 121 and 122 are added by a SIMD adder 131, and then entered in the 32 high-order bits of flip-flop 161 in synchronism with a clock signal. At the same time, the 32 low-order bits of flip-flops 121 and 122 are added by the SIMD adder 131, and entered in the 32 low-order bits of flip-flop 161 in synchronism with a clock signal. The value entered in flip-flop 161 is written into the register file 101 by the control signal 111.
FIG. 2 shows the format of an addition instruction that is based on the conventional SIMD method. This instruction's mnemonic for an assembler is DADD Rm, Rn, Rd. Rm and Rn are input registers. Rd is an output register. Elements 201 through 206 compose a 32-bit instruction code. Elements 201 and 203 are 6-bit and 4-bit bit fields, respectively, and op codes. Element 206 is a 4-bit bit field, which is a reserved field. Elements 202, 204, and 205 are 6-bit bit fields. Elements 202 and 204 are operands that specify an input register. Element 205 is an operand that specifies an output register.
FIG. 3 shows the relationship between the values in bit fields 202, 204, and 205 and the registers to be specified. Row 301 shows a bit pattern written in an operand. Row 302 indicates an associated register. As stated above, the conventional SIMD method specifies an operand normally on an individual register basis.
The input/output register bit field position is fixed for all simultaneously performed arithmetic operations. Therefore, when the operand is determined for any one of the simultaneously performed arithmetic operations, the operands of the other arithmetic operations are automatically determined.
For example, in the performance of a process where the subscripts of arrays a[ ], b[ ], and c[ ] in the memory are added and then the result is stored in array s [ ], when such a process is written in C language, Equation (1) is obtained as follows:
                                                        For              ⁢                                                          ⁢                              (                                                      i                    =                    0                                    ;                                      I                    <                    MAX                                    ;                                      i                    ++                                                  )                                                                                        {                                                                    S                    ⁡                                          [                      i                      ]                                                        =                                                            a                      ⁡                                              [                        i                        ]                                                              +                                          b                      ⁡                                              [                        i                        ]                                                              +                                          c                      ⁡                                              [                        i                        ]                                                                                            ;                            }                                                          (        1        )            
Next, the above process is performed with the conventional SIMD processor described above. Assume that a logical operation, shift operation, and addition operation cannot be performed in parallel. FIG. 4 shows a process that is performed using a software pipelining technique. Note that load/store operations performed relative to the memory are omitted. The reason is that the necessity for considering load/store operations can be eliminated by properly arranging instructions in situations where an arithmetic operation and load/store operation can be performed in parallel.
Each of the reference numerals 501 to 504 indicate one-clock-cycle The reference numerals 511 to 513 indicate 64-bit registers. These registers provide the input of an arithmetic operation performed on cycle 501. The reference numerals 521 to 523 indicate 64-bit registers. These registers receive the output of an arithmetic operation performed on cycle 501 and provide the input of an arithmetic operation to be performed on cycle 502. The reference numerals 531 to 533 indicate 64-bit registers. These registers receive the output of an arithmetic operation performed on cycle 502 and provide the input of an arithmetic operation to be performed on cycle 503. The reference numerals 541 to 543 indicate 64-bit registers. These registers receive the output of an arithmetic operation performed on cycle 502 and provide the input of an arithmetic operation to be performed on cycle 504. The reference numerals 551 to 553 indicate 64-bit registers. These registers receive the output of an arithmetic operation performed on cycle 504.
Registers 511, 521, 531, 541, and 551 do not separately exist. They represent the results of changes in the contents of the same register. In other words, the contents of the register sequentially change from 511 to 521 to 531 to 541 to 551 on cycles 501, 502, 503 and 504, respectively. The same holds true for a combination of registers 512, 522, 532, 542, and 552 and a combination of registers 513, 523, 533, 543, and 553.
The reference numerals 514 and 515 indicate adders. These are used as 32-bit adders, which are obtained by dividing a single 64-bit adder into two by the SIMD method. Adders 514 and 515 perform an arithmetic operation on cycle 501. The reference numerals 544 and 545 indicate adders. Adders 544 and 545 perform an arithmetic operation on cycle 504.
Adders 514 and 544 do not separately exist. They represent arithmetic operations that are performed respectively on cycles 501 and 504 by the same adder. The same holds true for a combination of adders 515 and 545. The reference numeral 524 indicates a 32-bit logical shifter, which performs an arithmetic operation on cycle 502. The reference numeral 534 indicates an arithmetic unit that performs a 64-bit OR operation on cycle 503.
On cycle 501, the addition processes for the “ith” element and “i−1th” element are simultaneously performed. On cycle 504, the addition processes for the “ith” element and “i+1th” element are simultaneously performed. The technique for processing different elements on the same cycle in this manner is called “software pipelining”. On cycles 502 and 503, the 32 high-order bits of register 523, which represent adder 514's output on cycle 501, are moved to the 32 low-order positions of register 533 by the shifter 524, ORed with the contents of register 531, and stored in the 32 low-order bits of register 541.
When software pipelining is conducted by the conventional SIMD method, a[i+1] must be stored in the 32 high-order bits of register 541, which is input to adder 544, and the result of an arithmetic operation performed by adder 514 must be stored in the 32 low-order bits of register 541. However, the result of an arithmetic operation performed by adder 514 is always stored in the register's 32 high-order bits. As such being the case, the arithmetic operations performed on cycles 502 and 503 are required for moving the stored result to the 32 low-order bits. For such a purpose, three cycles are required per element. Therefore, it is obvious that the performance level is reduced to ⅓ the level prevailing during the aforementioned ideal status.
FIG. 5 shows a case where two elements are simultaneously processed by the conventional SIMD method but without using the software pipelining technique.
Each of the reference numerals 601 and 602 indicate one-clock-cycle. The reference numerals 611 to 613 indicate 64-bit registers. These registers provide the input of an arithmetic operation to be performed on cycle 601. The reference numerals 621 to 623 indicate 64-bit registers. These registers receive the output of an arithmetic operation performed on cycle 601 and provide the input of an arithmetic operation to be performed on cycle 602. The reference numerals 631 to 633 indicate 64-bit registers. These registers receive the output of an arithmetic operation performed on cycle 602.
Registers 611, 621, and 631 do not separately exist. They represent the results of changes in the contents of the same register. In other words, the contents of the register sequentially change from 611 to 621 to 631 on cycles 601 and 602. The same holds true for a combination of registers 612, 622, and 632 and a combination of registers 613, 623, and 633.
The reference numerals 614 and 615 indicate adders. These are used as 32-bit adders, which are obtained by dividing a single 64-bit adder into two by the SIMD method. Adders 614 and 615 perform an arithmetic operation on cycle 601. The reference numerals 624 and 625 indicate adders. Adders 624 and 625 perform an arithmetic operation on cycle 602.
Adders 614 and 624 do not separately exist. They represent arithmetic operations that are performed respectively on cycles 601 and 602 by the same adder. The same holds true for a combination of adders 615 and 625.
On cycle 601, the processes for the “ith” element and “i+1th” element of arrays a[ ] and b[ ] are simultaneously performed. On cycle 602, the processes for the result of cycle 601 and the “ith” element and “i+1th” element of array c[ ] are simultaneously performed. In this case, the attained performance is 1 cycle per element. However, it is not adequate because the number of registers required for processing is increased by ½. For example, registers 611, 621, and 631 are regarded here as physical registers, and arrays a[ ], b[ ], and c[ ] correspond to data stored in two logical registers in a physical register.
As described above, the actual performance of a SIMD processor may be lower than its peak performance when it performs certain types of processing operations. Performance deterioration occurs particularly when arithmetic operations comprising a process are interdependent and software pipelining is required. The peak performance can be maintained by simultaneously processing the “ith” and “i+1th” elements. However, such performance maintenance would make the required number of registers greater than in the ideal case since SIMD instructions do not provide a high degree of freedom in specifying input/output registers.