1. Technical Field
The present invention relates to a method for performing calculation operations using a pipelined calculation device comprising a group of at least two pipeline stages, at least one data interface for input of data, and at least one data interface for output of data, said pipeline stages having at least one data interface for input of data and at least one data interface for output of data, in which method data for performing a first and a second calculation operation is input to the device. The invention also relates to a device and a system for performing the method.
2. Discussion of Related Art
There are many applications in which multiplication, multiply-accumulate (MAC) and other calculation operations are needed. As a non-restrictive example many signal processing applications, such as digital signal filtering applications, video/image processing applications etc., are applications in which real time multiplication operations are implemented. Also other applications in which vector and/or matrix operations are needed use multiplication and MAC operations. Multiplication operations are normally implemented as summation and bit-wise shifting operations. Such multiplication operations are resource demanding tasks because one multiplication operation between two operands needs many summation and shifting operations to calculate the multiplication result.
Specific to video/image processing algorithms is the vast amount of computations that have to be implemented in real time. Therefore, the high-speed performance has been the driving force in developing parallel specialized structures (accelerators) for different video/image processing algorithms or subtasks. A prior art video/image processing system involves several such accelerators (e.g. Sum of Absolute Differences (SAD), cosine transform, etc.) each being composed of large amount of hardware elements. However, with developing mobile communication systems, the hardware area, which affects to the costs of the system, and the power/energy consumption are as important properties as the high-speed performance. One way towards satisfying all these criteria is further modernization of Digital Signal Processors (DSPs) and reduction of the number of specialized accelerators. Although some improvements exist on this area the developed systems still does not always meet the high-speed and power consumption requirements.
Table 1 summarizes some core arithmetic patterns along with examples of video/image processing algorithms where these patterns are frequently implemented. Operations involved in these patterns are very basic ones and are very familiar. A vast amount of the literature is devoted to their implementations. Here two specifics are emphasized concerning these operations in the video/image processing context. First, operands of operations are most often medium (8- to 16-bit) precision integers. Secondly, most of the algorithms use massively parallel operations. In some cases these parallel operations share the same operands. For example, in a scalar quantization, the same number is multiplied to many pixels of the image, in a matrix-vector product different rows of the matrix are multiplied to the same input vector, in Finite Impulse Response (FIR) filtering the same coefficients are involved in a number of MAC operations, etc.
TABLE 1ArithmeticpatternDescriptionAlgorithmsParalleldi = ai ± bl,Motion compensation, luminanceadditions/i = 1, . . . , kchanging, suboperation in DCT,subtractionsDWT, SAD, etc. Accumulation  s  =            ∑              i        =        1            K        ⁢          a      i      Averaging filter in pre- and post- processing, suboperation in DWT, vector-vector and matrix-vector inner products, convolution, etc. Parallelmi = aixi orQuantization, suboperation in DCT,multiplicationsmi = aix,DWT, vector-vector and matrix-i = 1, . . . , Kvector inner products, convolution,etc.Multiply-si = si−1 + aixi,Basic operation in FIR filtering andaccumulatei = 1, . . . , K,matrix-vector operations.(MAC)s0 is a knowninteger Vector-vector inner product  s  =            ∑              i        =        1            K        ⁢                  a        i            ⁢              x        i            Pre- and post-processing, suboperation in DCT, DWT, vector- vector and matrix-vector inner products, convolution, etc. Matrix-vector product                                          s            i                    =                                    ∑                              j                =                1                            P                        ⁢                                          a                                  i                  ,                  j                                            ⁢                              x                j                                                                                              i            =            1                    ,          …          ⁢                                          ,          K                       Color conversions, geometric manipulations, affine motion estimation, Pre- and post- processing, suboperation in DCT, etc. FIR filtering (convolution)                                          s            i                    =                                    ∑                              j                =                1                            P                        ⁢                                          a                j                            ⁢                              x                                  i                  -                  j                                                                                                              i            =            1                    ,          …          ⁢                                          ,          K                       Pre- and Postprocessing (image filtering enhancement, interpolation, extrapolation), basic operation of DWT SAD (sum of absolute differences)      s    =                  ∑                  i          =          1                K            ⁢                                            a            i                    -                      b            i                                        ,Motion estimation, an image fidelity criterium MAE
The prior art architectures for video and image processing and other signal processing tasks are usually based on conventional multiplier structures. Many multiplication methods and a very large variety of devices for implementing multiplication and/or multiply-accumulate operations have been developed. In the following, only multiplication methods and general multiplier structures for the case where both the operands (the multiplicand and the multiplier) are unknown, multiplication of two fixed-point signed integers presented in two's complement arithmetic, and the so called radix-T methods will be considered.
The two's complement representation of the n-bit (including the sign) multiplier a will be denoted as ã=an−1an−2 . . . a1a0, and the two's complement representation of the m-bit (including the sign) multiplicand x will be denoted as {tilde over (x)}=xm−1xm−2 . . . x1x0, respectively. The relation between a and ã (and a similar relation between x and {tilde over (x)}) is as follows:
                                                                        a                =                                                                            -                                              a                                                  n                          -                          1                                                                                      ⁢                                          2                                              n                        -                        1                                                                              +                                                            ∑                                              r                        =                        0                                                                    n                        -                        2                                                              ⁢                                                                  a                        r                                            ⁢                                              2                        r                                                                                                        ,                                                          x              =                                                                    -                                          x                                              m                        -                        1                                                                              ⁢                                      2                                          m                      -                      1                                                                      +                                                      ∑                                          l                      =                      0                                                              m                      -                      2                                                        ⁢                                                            x                      l                                        ⁢                                                                  2                        l                                            .                                                                                                                              (        1        )            
In a radix-T parallel multiplication method the two's complement {tilde over (y)}=ym+n−1ym+n−2 . . . y1y0 of the product y=a·x is obtained according to the formula
                              y          ~                =                              ∑                          r              =              0                                                      n                                  radix                  -                  T                                            -              1                                ⁢                                    (                                                A                  r                                ⁢                x                            )                        ⁢                          2                              rt                                  radix                  -                  T                                                                                        (        2        )            in the following two main steps:    Step 1. Generation of partial products (PP) Ar·x, r=0, . . . , nradix-T−1 such that Equation (2) is valid.    Step 2. Summing up in parallel all the partial products preliminary shifting the rth partial product Ar·x, r=0, . . . , nradix-T−1 for rtradix-T positions to the left.
A radix-T MAC unit operates in a similar way with the difference that another number (accumulating term) is added along with partial products at the Step 2.
The Step 1 will now be considered in more detail. Depending on how the numbers Ar·x, r=0, . . . , nradix-T−1 are defined and obtained, different multiplication methods can be derived. In turn, the choice of the numbers Ar·x, r=0, . . . , nradix-T−1 is, in fact, dictated by representation of the multiplier a. The simplest multiplication method is a radix-2 method, which uses the basic two's complement representation of a given in the left equation of (1). In this case, the two's complement of the product will be obtained as:
                                          y            ~                    =                                                    ∑                                  r                  =                  0                                                                      n                                          radix                      -                      2                                                        -                  1                                            ⁢                                                (                                                            A                      r                                        ⁢                    x                                    )                                ⁢                                  2                                      rt                                          radix                      -                      2                                                                                            =                                                            ∑                                      r                    =                    0                                                        n                    -                    2                                                  ⁢                                                      (                                                                  a                        r                                            ⁢                      x                                        )                                    ⁢                                      2                    r                                                              -                                                (                                                            a                                              n                        -                        1                                                              ⁢                    x                                    )                                ⁢                                  2                                      n                    -                    1                                                                                      ,                            (        3        )            that is, nradix-2=n, and the partial products Ar·x, r=0, . . . , n−1 are defined by Ar=ar for r=0, . . . , n−2, and An−1=−an−1 for r=n−1. These partial products may simply be (and are usually) formed using an array of 2-input AND gates between every two's complement bit of the multiplier ã and the multiplicand {tilde over (x)}. The value of Ar·x, r=0, . . . , n−1 is multiplied to 2r (i.e. is shifted to the left for r positions) before being accumulated at the second step. It should be noted that in this method the partial product An−1·x, which sometimes is also called a correction factor, is treated differently from the other partial products.
The non-uniform nature of the partial products is avoided in another radix-2 multiplication method based on Booth recoding of the two's complement bits ã of the multiplier into redundant signed digit numbers. The product can now be presented as:
                                          y            ~                    =                                                    ∑                                  r                  =                  0                                                                      n                                          radix                      -                      2                                                        -                  1                                            ⁢                                                (                                                            A                      r                                        ⁢                    x                                    )                                ⁢                                  2                  r                                                      =                                          ∑                                  r                  =                  0                                                  n                  -                  1                                            ⁢                                                (                                                            -                                              a                        r                                                              +                                          a                                              r                        -                        1                                                                              )                                ⁢                x                ⁢                                                                  ⁢                                  2                  r                                                                    ,                              a                          -              1                                =          0                                    (        4        )            
That is, nradix-2=n, as before but the partial products Ar·x, r=0, . . . , n−1 are all now defined by Ar=−ar+ar−1. Similarly to the previous method, the value of Ar·x, r=0, . . . , n−1, is multiplied to 2r before being added at the second step. In this scheme the partial products are selected among 0,±x. The two of these values (0 and x) are readily available while finding −x requires inverting the bits of {tilde over (x)} and adding unity. Normally, addition of unity is performed in the Step 2 where the partial products are summed.
There are totally nradix-2=n partial products to be summed in a radix-2 multiplication method irrespective of if the Booth recoded or non-recoded method is used. In order to reduce the number of partial products and, hence, the delay of the second stage (summing up partial products), the radix-4 Modified Booth Algorithm (MBA) based method has been developed. The MBA is one of the most popular multiplication methods and is being extensively studied and optimized.
In order to simplify the formulae below, in every case where a term like n/k occurs, it is assumed that n is an integer multiple of k. This is a valid assumption since a two's complement number may be complemented with an arbitrary number of bits (repeating the most significant bit).
In MBA, the two's complement of the product is obtained as the sum
                                          y            ~                    =                                                    ∑                                  r                  =                  0                                                                      n                                          radix                      -                      4                                                        -                  1                                            ⁢                                                (                                                            A                      r                                        ⁢                    x                                    )                                ⁢                                  2                                      2                    ⁢                    r                                                                        =                                          ∑                                  r                  =                  0                                                                      n                    /                    2                                    -                  1                                            ⁢                                                (                                                            [                                                                                                    -                            2                                                    ⁢                                                      a                                                                                          2                                ⁢                                r                                                            +                              1                                                                                                      +                                                  a                                                      2                            ⁢                            r                                                                          +                                                  a                                                                                    2                              ⁢                              r                                                        -                            1                                                                                              ]                                        ⁢                    x                                    )                                ⁢                                  2                                      2                    ⁢                    r                                                                                      ,                              a                          -              1                                =          0                                    (        5        )            of nradix-4=n/2 partial products, where the value of Ar∈{−2,−1,0,1,2}, r=0, 1, . . . , n/2−1, is chosen according to three consecutive bits a2r+1,a2r,a2r−1 (a−1=0) of the two's complement representation of the multiplier ã. The partial product Arx, r=0, 1, . . . , n/2−1, is multiplied by 22r (i.e. hardware shifted to the left for 2r positions) before being added at the Step 2.
It is also possible to use radices higher than 2 with non-recoded multiplication methods for reducing the number of partial products. For example, in radix-4 non-recoded multiplication method the partial products Arx, Ar∈{0,1,2,3}, r=0, 1, . . . , n/2−1 are chosen according to two consecutive bits a2r+1,a2r of the multiplier. There are nradix-4=n/2 partial products in this method. The potential partial product 2x can be generated by shifting potential partial product x once to the left. The odd partial product 3x needs an additional summation of x. If multiplications of negative numbers are also used, the sign extension must be used, in which the most significant bit (i.e. the sign bit) of each partial product is copied as many times as is necessary to achieve the required bit-length.
In radix-8 non-recoded multiplication method the partial products Arx Ar∈{0,1,2,3,4,5,6,7}, r=0, 1, . . . , n/3−1, are chosen according to three consecutive bits of the multiplier. The list of potential partial products is 0, x, 2x, 3x, . . . , 7x all of which become available by implementing three independent additions/subtractions in order to obtain 3x=x+2x, 5x=x+4x, 7x=8x−x. The potential partial product 6x can be formed by shifting the potential partial product 3x one position to the left. For the cases of higher radices (>=16), however, there are some potential partial products (e.g. 11x and 13x) that cannot be obtained in one addition/subtraction.
FIG. 1 presents a general device 101 for performing the Modified Booth Algorithm. There are n/2 Booth encoding-decoding rows, each row comprises a Booth encoder 102, and m+1 Booth decoders 103, which may be grouped by two. Every Booth encoder 102 analyzes three consecutive bits of the two's complement of the multiplier ã, with an overlap of one bit, and outputs q signals to the corresponding row of decoders 103. In some recent prior art designs the value of q=3. According to these q-signals the decoder rows form the partial products (Arx)∈{0,±x,±2x} having the bits {tilde over (x)} of the multiplicand at their inputs. The nonnegative multiples of x are readily available since 2x is formed by a hardwired shift. The negative multiples of x are formed by inverting the bits of the corresponding positive multiples of x and then adding 1 which is usually performed at the Step 2. For example, U.S. Pat. No. 6,173,304 describes such a system implementing the Booth encoders and decoders. In the radix-2 method the partial products can be found easier than in the Modified Booth Algorithm but the number of the partial products is reduced to n/2 when the Modified Booth Algorithm is used which leads to significant advantages in speed performance, area, and power consumption.
In order to further reduce the number of partial products, the Booth encoding has further been extended to multibit (arbitrary radix-T) recoding. The general equation for the product is now given as:
                                          y            ~                    =                                                    ∑                                  r                  =                  0                                                                      n                                          radix                      -                      T                                                        -                  1                                            ⁢                                                (                                                            A                      r                                        ⁢                    x                                    )                                ⁢                                  2                  rt                                                      =                                          ∑                                  r                  =                  0                                                                      n                    /                    t                                    -                  1                                            ⁢                                                (                                                            [                                                                                                    -                                                          a                                                              tr                                +                                t                                -                                1                                                                                                              ⁢                                                      2                                                          t                              -                              1                                                                                                      +                                                                              ∑                                                          i                              =                              0                                                                                      t                              -                              2                                                                                ⁢                                                                                    a                                                              tr                                +                                i                                                                                      ⁢                                                          2                              i                                                                                                      +                                                  a                                                      tr                            -                            1                                                                                              ]                                        ⁢                    x                                    )                                ⁢                                  2                  tr                                                                    ⁢                                  ⁢                                            a                              -                1                                      =            0                    ,                      T            =                                          2                t                            =                                                2                                      t                                          radix                      -                      T                                                                      .                                                                        (        6        )            
That is, there are nradix-T=n/t partial products (T=2t) and every partial product is selected according to t+1 consecutive bits of the multiplier ã from the list of potential partial products Ax with A ranging between −2t−1 and 2t−1. Every potential partial product may be relatively easily formed by addition of two (for T=8,16) or more (for T>16) power-of-2 multiples of x, and, possibly, inversion of the bits followed by addition of 1 (at Step 2). For example, in the case of radix-8 recoding, the list of potential partial products is 0, ±x, ±2x, ±3x, ±4x. All the nonnegative multiples from this list are readily available except for 3x which may be obtained in one addition: 3x=x+2x. Negative multiples may be found by invert-add-1 method, as before. In the case of radix-16 recoding, the list of potential partial products is 0, ±x, ±2x, ±3x, . . . , ±8x all of which become available by implementing three independent additions/subtractions in order to obtain 3x=x+2x, 5x=x+4x, 7x=−x+8x. The potential partial product 6x can be formed by shifting the potential partial product 3x one position to the left. For the cases of higher radices, however, there are some potential partial products (e.g. 11x and 13x) that cannot be obtained in one addition/subtraction.
FIG. 2 presents the general structure 201 of prior art radix-T (T≧8) multibit Booth recoded and radix-T (T≧4) new non-recoded (“radix-higher-than-four”) multipliers. This structure comprises an array of adders 202 for computing the list of potential partial products 0, ±x, ±2x, . . . , ±Tx, a selection block 203 for selecting n/t partial products according to the multiplier bits, and a summation block 204 for summing up the selected partial products. The final adder 205 forms the product {tilde over (y)} from sum S and carry C terms produced by the summation block 204.
The array of adders of a typical prior art radix-higher-than-four multiplier comprises s adders/subtractors, where s is the number of odd positive multiples of x involved in the list of potential partial products (s=1 in the cases of T=8 Booth recoded and T=4 non-recoded multipliers, and s=3 in the cases of T=16 Booth recoded or T=8 non-recoded multipliers, etc.). Usually, fast carry-look-ahead (CLA) adders are used since forming the list of potential partial products is rather time consuming part of such multipliers. In a patent U.S. Pat. No. 5,875,125 a special x+2x adder has been proposed which may be used in radix-8 multipliers. It should be noted that mixed radix-4/8 multipliers have also been proposed, for example in U.S. Pat. No. 4,965,762, which, however, are mainly useful for iterative (not parallel) multipliers where the partial products are generated and accumulated serially. U.S. Pat. No. 5,646,877 describes a multiplier structure where all the potential partial products for an arbitrary radix are obtained as sums or differences of shifted versions of 3x and of x within the array of adders comprises an x+2x adder for generating 3x, two shifters and an adder/subtracter.
The selection block of a typical prior art radix-higher-than-four multiplier comprises n/t radix-T Booth encoders and equal number of decoder rows. Each encoder analyzes the corresponding (t+1)-tuple of the multiplier and outputs a plurality of control signals according to which the corresponding partial products are formed by the decoder rows. Remarks on how to extend the radix-4 Booth encoders and decoders to higher radices are given, for example, in a patent U.S. Pat. No. 6,240,438.
In the following, the summing up the partial products, i.e. the Step 2, will be considered in more detail. Most of the parallel multiplier/MAC unit structures use summation blocks composed of a compression array followed by a fast adder (final adder) for summing up the partial products formed at Step 1 (see FIGS. 1 and 2). The compression array reduces the nradix-T partial product rows to two rows corresponding to sum S and carry C terms that are added with the final adder. The compression array is usually composed of either full and half adders (a carry-save-adder-tree or a Wallace tree) or 4:2 compressors. The final adder is usually a fast carry-look-ahead adder, which is carefully designed according to the delays of different bits from the compression array.
It should be noted that, if a Booth recoding scheme is utilized, then, as a result of performing the unity addition in Step 2 instead of in Step 1, every partial product row is accompanied with a one-bit value, which is zero if the partial product is a nonnegative multiple of the multiplicand and is unity otherwise. Thus, actually, the number of rows is 2nradix-T. Even though these one-bit values may be merged into partial product rows in such a way that the number of rows is again nradix-T or maybe nradix-T+1 but with the price of increasing the length of partial product rows (by one bit) and making them irregular. In a non-recoded scheme, there is at most one extra one-bit value so simpler compression arrays can be designed.
Another problem associated with the summation block in a Booth recoded multiplier is how to handle sign extensions since the partial product rows are shifted with respect to each other before adding them. In a straightforward implementation every partial product (after shifting) should have been extended to a (n+m)-bit number, which is very wasteful approach. Special sign extension methods and circuits have been developed to reduce the number of sign extended bits to two in every row. In the case of non-recoded multipliers, sign extensions may be handled easier, with no extra sign bits, since all but possibly one partial products are of the same sign.
There are principally two ways of extending multiplier structures to MAC units as depicted in FIGS. 3a and 3b. In the first case (FIG. 3a), the two outputs (the sum S and carry C terms) of the compression array 301 are fed back to its inputs so that the current partial product values are accumulated with the two addends of the current accumulation value. The final sum S and carry C terms are then added within the final adder 302. In the second case (FIG. 3b) these outputs are fed to another compression array 303 outputs of which are fed back to its (the second compression array 303) input. Now, the sum S and the carry C terms of the current product are accumulated to the current accumulation value until the last cycle when the final sum S and carry C terms are added within the final adder 302. The depth (and, therefore, the overall delay) of the whole compression array may be smaller in the first case while the width and, therefore, the area and power consumption may be smaller in the second case.
As a summary of high-radix multipliers, it should be noted that the higher the radix the higher the complexity of the Step 1 (generating partial products) but lower the complexity of the Step 2 (summing up partial products). The “radix-higher-than-four” multiplication methods have not gained popularity, perhaps due to the necessity of having rather time and area consuming partial product generators, including both the array of adders and the selection block. Commonly, the radix-4 MBA is considered the best prior art parallel multiplication method and is used in many industrial multipliers.
A method called pipelining can be used in connection with calculation operations. Hence, a device utilizing pipelining comprises two or more pipeline stages. Each pipeline stage is intended to perform a certain part or parts of calculation operations (i.e. sub-operations). In the prior art the calculation operations of the pipeline stages relate to each other so that each pipeline stage performs one or more sub-operations of the calculation operation to be performed, and the output of the last pipeline stage provides the result of the calculation operation. In such a device different pipeline stages operate in succession, where the next pipeline stage begins the calculation of the sub-operation after the previous pipeline stage has finished the calculation of its sub-operation. If pipeline stages are poorly balanced (i.e. some stages are significantly faster than the others), this means that all but one pipeline stage is waiting or is in an idle state most of the time. Furthermore, all the pipeline stages are reserved for a certain task (calculation of a certain sub-operation) and they cannot be used for performing other calculation operations.
In the following, some multiplier/MAC unit features, which are desired from a video and image processing point of view but are absent or weak in prior art solutions, will be presented. First, the most popular radix-4 Booth recoded multiplier/MAC method will be considered. As a general drawback of this method it is more power consuming than higher radix methods. Another general drawback is that, even though the number of partial products is reduced to half compared to radix-2 multiplication, it still could be reduced using higher radices. That is, the complexity of this method is mainly concentrated in the Step 2 (summing up partial products). When pipelining a radix-4 Booth-recoded multiplier/MAC structure, usually the partial product generation block is considered as the first pipeline stage which, however, is poorly balanced with (i.e. faster than) the other pipeline stages.
Considering the “radix-higher-than-four” Booth recoded multipliers, it has been shown that different realizations of these multipliers, when considered for implementation of only the multiplication operation, perform competitively to the radix-4 multipliers with respect to the area and time criteria while outperforming those with respect to the power consumption. The main drawback of the “radix-higher-than-four” methods is the necessity of having an array of adders at the partial product generation block.
The Booth recoded “radix-higher-than-four” multipliers also have a drawback related to the necessity of handling the negative multiples of the multiplicand as well as the sign extensions.
A radix-T non-recoded multiplier involves the same number of adders in the potential partial product generation block as the radix-(2T) Booth recoded one. When “radix-higher-than-eight” non-recoded or “radix−higher-than-sixteen” Booth recoded multipliers are used, more than one-level addition is needed to generate potential partial products.
TABLE 2aMultiplier TypeBR, T=4BR, T=8BR, T=16AA width, s—1 3# of potential PPs5917Components ofEncod.n/2 BR4n/3 BR8n/4 BR16the SBDecod.n/2 (m+1)−BD4n/3 (m+2)−4:1 t/cn/4 (m+3)−8:1 t/cSEYesYesYesDelay of the SB6t12t16t# of inputs to CAN/2 (m+1)-bit+3n/2n/3 (m+2)-bit+4n/3n/4 (m+3)-bit+5n/41-bit1-bit1-bit# of inputs/n=13, x7/4/8t5/3/6t4/2/4tlevels/delayMAC9/5/10t8/4/8t6/3/6tof the FA-CAn=16, x8/4/8t6/3/6t4/2/4tMAC10/5/10t8/4/8t6/3/6tn=64, x32/8/16t22/7/14t16/6/12tMAC34/9/18t24/7/14t18/6/12t# of inputs/n=13, x7/2/6t5/(4:2)+FA/5t4/1/3tlevels/delayMAC9/2(4:2)+FA/8t7/2/6t6/2/6tof the 4:2-CAn=16, x8/2/6t6/2/6t4/1/3tMAC10/3/9t8/2/6t6/2/6tn=64, x32/4/12t22/4/12t16/3/9tMAC34/5/15t24/4/12t18/4/12t
TABLE 2bMultiplier TypeNR1, T=4NR1, T=8NR2, T=4NR2, T=8AA width, s 2 4 1 3# of potential PPs 4 8 4 8ComponentsEncod.NoNo1 BR41 BR8ofDecod.n/2 (m+1)−4:1n/3 (m+2)−8:1(m+1)−(BD4+n/2(4:1))(m+2)(4:1 t/c+n/3(8:1))the SBSENoNoNoNoDelay of the SB 5t 6t 6t12t# of inputs to CA((n−1)/2+1)((n−1)/3+1)(n−1)/2 (m+4)-bit+1(n−1)/3 (m+6)-bit+1(m+2)-bit(m+3)-bit1-bit1-bit# of inputs/n=13, x 7/4/8t 5/3/6t 6/3/6t 4/2/4tlevels/MAC 9/4/8t 7/4/8t 8/4/8t 6/3/6tdelay ofn=16, x 9/4/8t 6/3/6t 8/4/8t 5/3/6tthe FA-MAC11/5/10t 8/4/8t10/5/10t 7/4/8tCAn=64, x33/8/16t22/7/14t32/8/16t21/7/14tMAC35/9/18t24/7/14t34/9/18t23/8/16t# of inputs/n=13, x 7/2/6t 5/(4:2)+FA/5t 6/2/6t 4/1/3tlevels/MAC 9/2(4:2)+FA/8t 7/2/6t 8/2/6t 6/2/6tdelay ofn=16, x 9/2(4:2)+FA/8t 6/2/6t 8/2/6t 5/(4:2)+FA/5tthe 4:2-MAC11/3/9t 8/2/6t10/3/9t 7/2/6tCAn=64, x33/4(4:2)+FA/14t22/4/12t32/4/12t21/4/12tMAC35/5/15t24/4/12t34/5/15t23/4/12t
TABLE 2cBRBooth recoded radix-T multiplierNR1Non-recoded radix-T multiplier of type 1NR2Non-recoded radix-T multiplier of type 2SESign extension circuitryBR4, BR8,Booth recoder circuitries for the corresponding radixBR16BD4Radix-4 Booth decoder circuitry4:1, 8:1, 4:1Multiplexers or true/complement multiplexers witht/c, 8:1 t/ccorresponding number of inputs.SBSelection BlockCA, FA-CA,Compression array, CA composed of Full (FA) and Half4:2 CAAdders (HA), CA composed of 4::2 compressors
Table 2a presents various characteristics of different blocks used in prior art Booth recoded radix-T multipliers/MAC units for multiplying an n-bit multiplier and an m-bit multiplicand, and Table 2b presents various characteristics of different blocks used in new Non-recoded radix-T multipliers/MAC units for multiplying an n-bit multiplier and an m-bit multiplicand, and Table 2b presents various characteristics of different blocks used in prior art Non-recoded radix-T multipliers for multiplying an n-bit multiplier and an m-bit multiplicand, respectively. Table 2c presents the acronyms utilized in Tables 2a and 2b. Analyzing Tables 2a and 2b, one can see that there are essential differences in the delays of different blocks of every multiplier in every multiplier/MAC unit type for most of the values of n and m. That is, a direct pipelined realization of these multipliers would suffer from poor balancing between pipeline stages. Trying to achieve better balancing between these pipeline stages, one could flexibly increase the throughputs of the first and the last pipeline stages by designing carry-look-ahead (CLA) adders with different numbers of FAs within one carry-ripple block. This is why the delays of these blocks are not indicated in Tables 2a and 2b. In some cases, e.g. small n, higher radix T, this may mean very small carry-ripple blocks, and therefore large area. Anyhow, using CLAs may be a solution for speeding up these two stages, though this solution is not the most efficient. The situation is different for the middle two pipeline stages (the SB and the CA) since the state-of-the-art circuits for these blocks are designed by optimizing critical paths in such a way that internal pipelining of these circuits would not be reasonable. Moreover, relative differences between the delays of these blocks are very different for every type (BR (T=8,16), NR1 (T=4,8)) of prior art multiplier/MAC unit structures for different values of n. For the cases of smaller n and higher radix T (e.g., T=16 and arbitrary n) the Selection Block is slower than the Compression Array while in other cases the situation is the opposite. This means that designing faster circuits for one or another block will not give a general solution to the problem of balancing these stages. All of these make it difficult to derive a systematic method for pipelining prior art multiplier/MAC structures with well-balanced pipeline stages.
On the other hand, if a parallel array of independent prior art multipliers/MAC units were to be employed to implement a plurality of corresponding operations, a large silicon area would be required even though the faster blocks could be shared without affecting overall throughput. In addition, as illustrated above, the pipeline stages would be poorly balanced if pipelined multipliers/MAC units were to be employed in such an array.
Prior art parallel architectures for matrix-vector arithmetic are designed so that they use a plurality of independent multipliers combined with adders or a plurality of independent MAC units. It is also usual to have a plurality of specific circuits for different kind of arithmetic operations. However, high radix multipliers include blocks (pipeline stages) that may be reused for other operations such as additions/subtractions, accumulations, etc. It should also be mentioned that, in matrix vector arithmetics, especially those used in video/image processing, there are many situations where one multiplicand is to be multiplied with several multipliers meaning that potential partial products of the multiplicand could have been reused in all of these multiplications if high-radix multiplication methods would be used. For example, in vector and matrix calculation operations, in which the same multiplier will be used for many times, the same partial products will be used more than once during the calculation. When such calculation operations are implemented in prior art architectures, they calculate the partial products each time they are needed, leading to inefficient utilization of hardware resources and additional power consumption. In addition, as was mentioned above, there are many situations in which pipeline stages of prior art multipliers are poorly balanced thus reducing the efficiency of the device.