1. Field of the Invention
The present invention relates to an apparatus and method for performing SIMD (Single Instruction Multiple Data) multiply-accumulate (MAC) operations.
2. Description of the Prior Art
When it is necessary to perform a particular data processing operation on a number of separate data elements, one known approach for accelerating the performance of such an operation is to employ a SIMD (Single Instruction Multiple Data) approach. In accordance with the SIMD approach, multiple of the data elements are placed side-by-side within a register and then the operation is performed in parallel on those data elements.
One type of operation which can benefit from the SIMD approach is the multiply-accumulate operation, which can take the form of A+B×C, or A−B×C. The multiplication operation B×C is typically performed multiple times for different values of B and C, with each multiplication result then being added (or subtracted) from the running accumulate value A.
Considering the operations required to generate a single multiply-accumulate result, it will be appreciated from the above discussion that a plurality of separate multiply operations are required, and by using SIMD data processing circuitry, a plurality of those required multiplications can be performed in parallel to increase the throughput of the multiply-accumulate operation.
However, there are also certain types of operation where multiple separate multiply-accumulate operations need to be performed in order to produce multiple multiply-accumulate results, but with there being significant overlap between the input data used for each multiply-accumulate operation. One particular example of an operation where multiple multiply-accumulate operations are required is the finite impulse response (FIR) filter operation, which is a standard signal processing task implemented in digital signal processors (DSPs). The FIR filter operation is commonly used in many signal processing applications, such as communication, audio processing, video processing or image processing.
Many contemporary digital signal processors, as well as general purpose microprocessors, use SIMD data processing circuitry in order to exploit the data-level parallelism present in operations such as the FIR filter operation. However, an important issue is how to effectively vectorise the FIR filter operation in order to exploit the SIMD capabilities of the data processing apparatus.
The article “Efficient Vectorization of the FIR Filter” by A Shahbahrami et al, Computer Engineering Laboratory, Delft University of Technology, the Netherlands (appearing on the Internet at http://ce.et.tudelft.nl/publicationfiles/1090—509_shahbahrami_prorisc2005.pdf) summarises various techniques for vectorising an FIR filter operation. In accordance with a first technique, the FIR filter is vectorised by vectorising the inner loop, such that the inner loop calculates several terms of a single output in parallel. Hence, by such an approach, multiple of the multiply operations required to form a single multiply-accumulate result are performed in parallel within the SIMD data processing circuitry during a single iteration, and accordingly each multiply-accumulate result is determined sequentially, with the SIMD capabilities of the processing circuitry being used to speed up the computation of each multiply-accumulate result. In accordance with an alternative technique described, the outer loop of the FIR filter is vectorised, such that the inner loop computes one term of several outputs in parallel. Hence, in accordance with this technique, in each iteration, one multiply-accumulate computation is performed in respect of each of the required multiply-accumulate results, so that all of the required multiply-accumulate operations are performed in parallel, and the final multiply-accumulate results for each of the multiply-accumulate operations become available following the final iteration of the process. The article also describes a third mechanism where the inner and outer loops are vectorised simultaneously.
One technique for vectorising the inner loop is described in the article “AltiVec™ Technology: A second Generation SIMD Microprocessor Architecture” by M Phillip, Motorola Inc, Austin, Tex. (appearing on the Internet at http://www.hotchips.org/archives/hc10/2_Mon/HC10.S5/HC10.5.3.pdf), where sum-across type instructions are used. This document describes techniques for vectorising either the inner or the outer FIR loop using the AltiVec multiply instructions. However, the outer loop technique uses vector multiply (or multiply-accumulate) operations that do not perform data re-arrangement function at the same time.
The publication “A Programmable DSP for Low-Power, Low-Complexity Baseband Processing” by H Naess, Norwegian University of Science and Technology, Department of Electronics and Telecommunications (appearing on the Internet at http://www.diva-portal.org/ntnu/abstract.xsql?dbid=1095) describes a technique for vectorising the outer loop, giving rise to repeated vector accumulate and shift operations. In particular, FIG. 9 of that publication shows an operation using two vector inputs and an internal shift register. This operation is executed multiple times through the issuance of multiple instructions within a repeat loop (as for example discussed in Table 10 of that document). Whilst the use of the internal shift register allows some internal rearrangement of data, it is necessary to iterate through the repeat loop multiple times in order to perform the required computations, and each time the repeat loop is repeated, instructions need to be decoded and executed, and new data values need to be accessed from memory.
The prior art techniques described above are generally aimed at improving performance of the FIR computations. However, another significant issue is power consumption. The inventors of the present invention realised that when performing sequences of MAC operations, such as are required when performing FIR operations, there are three key activities, namely instruction fetch and decode, the multiply-accumulate computations, and vector data re-arrangement computations required to order the data elements appropriately prior to each iteration. Further, the inventors noted that significant power was being consumed in the instruction fetch and decode and the vector data re-arrangement computations, for example 25-40% of the total power consumed.
Accordingly, it would be desirable to provide an improved technique for performing SIMD multiply-accumulate operations which reduces the power consumption when compared with the known prior art techniques.