1. Field of the Invention
The present invention relates to a microprocessor including an arithmetic unit to perform a multiply-accumulate (MAC) operation on complex numbers, which is frequently used in performing a finite impulse response (FIR) filter operation on complex numbers, for example.
2. Description of Related Art
A general expression for an input and output relation of an FIR filter in terms of time is given by the following equation (1).
                              Y          ⁡                      [            n            ]                          =                              ∑                          k              =              0                                      m              -              1                                ⁢                                          ⁢                                    W              ⁡                              [                k                ]                                      ·                          X              ⁡                              [                                  n                  -                  k                                ]                                                                        (        1        )            
In the equation (1), m represents the number of taps of the FIR filter, and W[k] represents a filter coefficient associated with a k-th tap. Further, X[n−k] represents an input complex data sequence, and Y[n] represents an output of the FIR filter. Such an FIR filter operation is a so-called “convolution operation” which is executed by repeating an MAC operation.
There have been various proposals for causing a microprocessor to effectively perform the FIR filter operation. For example, the application note on “AltiVec Complex FIR” (URL: http://www.freescale.com/webapp/sps/site/overview.jsp?code=DR PPCALTVCCFIR) of Freescale Semiconductor, Inc. discloses an example of a program for causing a processor employing a single instruction multiple data (SIMD) architecture capable of collectively processing 128-bit data to perform the FIR filter operation on complex numbers.
The processor disclosed in “AltiVec Complex FIR” employs the SIMD architecture. Specifically, the processor disclosed in “AltiVec Complex FIR” includes a plurality of similar MAC units for receiving a plurality data items (vector data) in response to a single instruction to perform MAC operations on the plurality of data items in parallel and output a plurality of MAC operation result data items.
Hereinafter, a description is made of an example for executing the FIR filter operation assuming that the tap number m is 4 in line with the concept disclosed in “AltiVec Complex FIR”. Note that “AltiVec Complex FIR” discloses a microprocessor capable of processing 128-bit vector data containing eight 16-bit data items to be processed in parallel. To facilitate the explanation, a microprocessor dealing with 64-bit vector data containing four 16-bit data items to be processed in parallel will be described below. To facilitate the explanation, the convolution operation equation (1) is transformed into the following equation (2).
                              Y          ⁡                      [            n            ]                          =                              ∑                          k              =              0                        3                    ⁢                                          ⁢                                    W              ⁡                              [                k                ]                                      ·                          X              ⁡                              [                                  n                  +                  k                                ]                                                                        (        2        )            
The output data Y[n] obtained by the equation (2) is represented by a total sum of four complex products. FIG. 12 shows data items Y[0] to Y[3]. When two data items are processed in parallel, the processor disclosed in “AltiVec Complex FIR” calculates Y[n] to Y[n+1] in parallel by using a single accumulator so as to enhance the effect of the parallel processing using the SIMD architecture. When four data items are processed in parallel, the processor disclosed in “AltiVec Complex FIR” calculates Y[n] to Y[n+3] in parallel with two accumulator registers.
For example, as shown in FIG. 12, in the case of calculating Y[0] and Y[1] in parallel, in a first step, a real part and an imaginary part of each of two complex products W[0]*X[0] and W[0]*X[1] are calculated, and the result is stored in an accumulator. In a second step, a real part and an imaginary part of each of two complex products W[1]*X[1] and W[1]*X[2] are calculated, and the result is added. Then, third and fourth steps are executed in a similar manner, with the result that the accumulator obtains a real part and an imaginary part of each of the data items Y[0] and Y[1]. FIG. 13 shows vector data groups used in the case of performing the calculations of the first to fourth steps shown in FIG. 12. Note that the register length of each of the registers R2 to R13 is 64 bits. Among a plurality of registers shown in FIG. 13, the registers R2 to R5 store real parts and imaginary parts of input data items X[0] to X[4]. For example, the register R2 stores a real part XR[0] and an imaginary part XI[0] of a data item X[0] as well as a real part XR[1] and an imaginary part XI[1] of a data item X[1]. The data length of each of XR[0], XI[0], XR[1], and XI[1] is 16 bits. The registers R6 to R13 hold filter coefficients W[1] to W[3].
FIGS. 14A and 14B each show a specific example of an arithmetic unit of the SIMD architecture. FIGS. 14A and 14B each show the configuration in which a first MAC circuit is disposed in parallel with a second MAC circuit. The first MAC circuit includes multipliers 9320 and 9321 and adders 9330 and 9340. The second MAC circuit includes multipliers 9322 and 9323 and adders 9331 and 9341. Note that the configuration of the arithmetic unit shown in FIGS. 14A and 14B is devised by the inventors of the present invention during the course of study on the improvement of the conventional microprocessor based on the description of “AltiVec Complex FIR”. Accordingly, the configuration of the arithmetic unit is neither known nor disclosed in “AltiVec Complex FIR”.
FIG. 14A shows a procedure for performing calculation for obtaining the real part of each of two complex products W[0]*X[0] and W[0]*X[1] among the calculations of the first step shown in FIG. 12 in response to a single instruction to instruct execution of the MAC operation. The real part of W[0]*X[0] is stored in lower-order 32 bits of the register R0 used as an accumulator. The real part of W[0]*X[1] is stored in higher-order 32 bits of the register R0.
On the other hand, FIG. 14B shows a procedure for performing calculation for obtaining the imaginary part of each of two complex products W[0]*X[0] and W[0]*X[1] among the calculations of the first step shown in FIG. 12 in response to a single instruction to instruct execution of the MAC operation. The imaginary part of W[0]*X[0] is stored in lower-order 32 bits of the register R1 used as an accumulator, and the imaginary part of W[0]*X[1] is stored in higher-order 32 bits of the register R1.
Note that, as shown in FIGS. 14A and 14B, when the register length of the accumulator storing the multiplication result is identical with the data length of the multiplication result, it is necessary to properly scale the accumulated value to be stored in the accumulator so as to avoid an overflow. The execution of such scaling causes a reduction in calculation accuracy. There is known a technique in which the register length of the accumulator storing the result of the MAC operation is set to be greater than the data length of the multiplication result to be cumulatively added (See Japanese Unexamined Patent Application Publication No. 10-134032 (Kubotaet al.) and U.S. Pat. No. 7,120,783 (Fotland et al.)). For example, when the multiplication result is 32-bit data, the accumulator to perform the MAC operation for cumulatively adding the data has a register length of 48 or 64 bits.
The inventors of the present invention have found the following fact. That is, when the technique for increasing the register length of the accumulator storing the MAC operation result so as to avoid the reduction in calculation accuracy due to the scaling, is applied to the technique for speeding up the FIR filter operation on complex numbers by using the SIMD architecture as illustrated in FIGS. 12, 13, 14A, and 14B, there arises a problem in that the number of registers to be allocated to accumulators for the MAC operation increases. For example, in the configuration shown in FIG. 14A, it is impossible for the register R0 of 64-bit length to hold two MAC operation results each having a length greater than 32 bits, and thus it is necessary to allocate another register to an accumulator for the MAC operation. In general, the number of operands of each instruction is limited, and the number of registers that can be specified by the operands of each instruction is usually limited. Accordingly, depending on instruction sets to be used, it is difficult to increase the number of accumulators for the MAC operation in some cases.