Multiplications are part of most digital signal processing algorithms. Often, hardware multipliers contribute significantly to the total energy and area cost of designs. Therefore, especially for the growing market of battery-powered high-volume devices, there is a need to further enhance the energy and area efficiency of multiplier implementations.
A multiplication can be separated in two basic operations, namely the generation of partial products and the accumulation of partial products. In general, multipliers can be distinguished in three main classes: array multipliers, parallel multipliers and iterative multipliers. An array multiplier comprises an array of identical cells which generate and accumulate partial products simultaneously. The circuits for generation and accumulation of partial products are merged. Array multipliers are primarily optimized for maximum speed, while area and energy efficiency is of lesser importance. Because of the high degree of parallelization, array multipliers consume a large area. The practical application of array multipliers is usually limited to high-performance computing.
Parallel multipliers generate partial products in parallel. Contrary to the array multiplier, for the accumulation, a common multi-operand adder is employed. Parallel multipliers are slower than array multipliers, but are typically more area and energy efficient.
Iterative multipliers generate and add the partial products sequentially. For each iteration, the same set of hardware blocks is utilized. Iterative multipliers are characterized by low area, low pin count, short wire length and high clock frequency. The short wire length is also beneficial with regard to technology scaling. Because for a single multiplication the same hardware blocks are typically utilized for several clock cycles, iterative multipliers are generally slower compared to parallel and array multipliers. Mainly due to the overhead in multiple register accesses, traditional iterative multipliers consume typically also more energy than parallel multipliers. However, by reducing the number of iterations, i.e. by making the number of iterations data-dependent, the energy efficiency gap can be greatly reduced.
The cost of a multiplication depends on the number of required partial products. The number of required partial products corresponds to the number of non-zero bits in the multiplier. A coding of the multiplier can reduce the number of non-zero bits and therefore the cost for the multiplication. The most common coding formats are Canonical Signed Digit (CSD), Booth and Signed Powers-of-Two (see for example “A simplified signed powers-of-two conversion for multiplierless adaptive filters”, Chao-Liang Chen, IEEE Int. Symp. on Circuits and Systems (ISCAS), 1996, vol. 2, pp. 364-367). The CSD format is well known in the art, see e.g. “Multiplier Policies For Digital Signal Processing” (Gin-Kou Ma, IEEE ASSP Magazine, vol. 7, issue 1, pp. 6-20, January 1990) and is presented more in detail below. Patent documents like EP1866741 B1 and US2006/155793 also relate to canonical signed digit multipliers.
Multiplications can generally be categorized in constant and in variable multiplications. For constant multiplications, the multiplier is known/fixed at design/compile time. Thus, the recoding (encoding) of the multiplier can be done a priori, i.e. offline. By applying the Dempster-Macleod's algorithm or similar methods, the efficiency can be further improved. In certain applications, such as transposed-form finite impulse response (FIR) filters, a multiplicand has to be multiplied with several constants. Instead of encoding and optimizing each constant separately, a common multiplier block can be generated. This technique, known as Multiple Constant Multiplication (MCM), can additionally reduce the cost significantly. In general, for constant multiplications, a huge optimization potential exists.
For variable multiplications, the multiplier is unknown/not fixed at design/compile time. Therefore, encoding and optimizations have to be carried out in hardware, i.e. on-line. The conversion from two's complement to CSD format can be implemented with look-up tables, with canonical recoding/encoding algorithm or with more complicated digital circuits. Because of the high complexity, optimizations, such as additive or multiplicative factoring, are hardly implemented in hardware. Besides, if only one multiplication with the same multiplier has to be performed, the high cost for applying such optimizations would not be justified. Hence, the optimization potential is lower compared to constant multiplications. Implementing a multiplier with asynchronous techniques can further increase the efficiency.
It is to be noted that it is common in the art to use the same word ‘multiplier’ to refer either to one of the data values involved in the multiplication (the other one usually being named the ‘multiplicand’) or to the actual device that performs the multiplication. In this description this convention is followed, as it is always clear from the context in which meaning the word ‘multiplier’ is used.
FIG. 1 shows a conventional multiplier. In a conventional iterative multiplier, the value of the multiplier is encoded iteratively, i.e. parts of the multiplier are encoded at every iteration. Immediately afterwards, the corresponding output product bits are computed. The encoded multiplier bits are never stored (at least not in a way, so that they are available for reuse in a further multiplication). This means, there is a common loop between the encoding hardware part (20) and the actual multiplication hardware part (40), wherein a previously computed value is fed back to the input of the multiplier (as illustrated with a looped arrow). In a prior art parallel/array multiplier there is no loop (i.e. the straight arrow). Between the encoding block (if present) and the actual multiplication block, a pipeline register may be present. However, this register cannot reuse the value at the subsequent iteration, but it is only used to increase the performance. So for every multiplication both blocks are active.
Traditional hardware multipliers, which operate in a binary system, compute the product with shift-and-add operations. The number of required shift/add operations depends thereby on the number of non-zero bits (=1's) in the multiplier. To reduce the number of non-zero bits, and hence, the cost of the multiplication, CSD coding can be applied. The CSD format extends the binary format by adding the digit ‘−1’. Hence, a CSD number is represented by a digit set of {1, 0, −1}. The CSD format reduces the number of non-zero bits by replacing strings of 1's with a single ‘−1’. This means, a series of additions is replaced by a single subtraction. The CSD multiplier hardware needs to support shift-and-add/subtract operations. The CSD format is a radix-two number system. It has the “canonical” property that in two consecutive bits at least one zero bit is present. The probability of a CSD digit cj being non-zero is given byP(|cj|=1)=⅓+( 1/9n)[1−(−½)n]  (1)
From (1) it can be seen that the number of non-zero bits of a n-bit CSD number never exceeds n/2. Moreover, as the word length growth, the number of non-zero bits reduces to n/3 on average. Compared to a binary number, the maximum number of non-zero bits reduces by 50% and the average number of non-zero bits reduces by 16.67%. The gain of CSD is most significant when long strings of 1's are present in the binary number.
The paper “A Multiplier Structure Based on a Novel Real-time CSD Encoding” (Y. Wang et al, 2007 IEEE Intl Symp. on Circuits and Systems, May 2007, pp. 3195-3198) proposes an iterative hardware multiplier that exploits the benefit of CSD coding. However, instead of generating the multiplier in CSD format, the Difference Form Signed (DFS) coding is used. The multiplier is scanned in groups of two bits, therefore maximal n/2 iterations are required (whereby n denotes the number of bits). To reduce the energy in the adder circuit, the adder circuit can be by-passed when zero partial products are detected. However, because the encoding/scanning of the multiplier and the add/sub circuit are directly coupled, the effective number of iterations is unchanged. Furthermore, whenever the add/sub circuit is bypassed, it cannot be reused for performing other operations of the application. The design is a hardware solution where most parameters, such as data path or multiplier word length, are fixed at design time. The paper “Iterative Radix-8 Multiplier Structure Based on a Novel Realtime CSD Encoding” (Y. Wang et al, 2007 Conf. Record of the Forty-First Asilomar Conference on Signals, Systems and Computers, November 2007, pp. 977-981) proposes a multiplier which has similar characteristics as the aforementioned design. Because it uses radix-8 instead of radix-4, the minimum number of required iterations is reduced. In the prior art an asynchronous iterative multiplier has been proposed, which exploits the multiplier value to avoid unnecessary iterations, leverages on Booth encoding. This proposal is again a hardware solution in which the encoding circuit and the adder circuit are coupled together and the word length is fixed at design time.
Many different multiplier techniques have been proposed in the art. In order to increase the multiplier speed a self-clocked, asynchronous, parallel CSD multiplier has been proposed. Parallel multipliers, which leverage on Binary CSD (BCSD) encoding, are known in the art, as well as parallel multipliers in which the word length is programmable. FIR filters with programmable CSD coefficients have been described and FIR filters where a limited set of CSD coefficients are stored in a look-up table, which can be selected at run time.
In the above-mentioned prior art designs the encoding of the multiplier and the actual computation of the product are coupled in the same control loop. For this reason, the effort of the multiplier encoding and the instruction computation step cannot be reused for other multiplications with the same multiplier. Furthermore, the time for encoding/scanning the multiplier also effects the time for the addition. Hence, there is a need for a solution in which the encoding and the actual multiplication are clearly separated, i.e. wherein they can be executed independent from each other.