1. Field of the Invention
The invention relates to area efficient realization of coefficient block [A] or achitecture [A] with hardware sharing techniques and optimizations applied to this block. The block [A] is connected to coefficient lines CLin_0, CLin_1 . . . CLin_n and BLin_0, BLin_!, . . . BLin_n coming from block [E] and/or [F], to be connected to perform filtering operation or a mathematical computing operation with optimization in hardware and provides a zero latency output. The invention also gives the area minimal realization of digital filters based on coefficient block[A], when operated in bit serial fashion. The optimization techniques and structure of the present invention are good for bit-serial digital filters typically a finite impulse response(FIR) filter, infinite impulse response filter(IIR) and for other filters and applications based on combinational logic consisting of delay element(T), multiplier(M), serial adder(SA) and serial subtractor (SS).
2. Description of the Related Art
Details of Elements/Symbol Used in the Description
The basic components symbol used in design are shown in “FIG. 2” of the drawings. In addition, explanation and usages of the device are done in the text below and depicted in “FIG. 3” and “FIG. 4” of the drawings.
Unit Delay (T)
It is one bit delay element. It also performs function of a multiplier by a factor of 2. (e.g. For the serial input frame (0101011 in binary or 43 in integer representation), the output of this block is (01010110 in binary or 86 in integer representation). This element is usually a Flip-flop (D Flip-flop, J-K Flip-flop etc.).
Full Adder (FA)
It performs binary addition. The inputs to this element are A, B, Cin (Carryin) while the outputs are Z and Cout (Carryout). The truth table for full adder functionality is shown in “FIG. 3” of the drawings.
Full Subtractor (FS)
It performs binary subtraction. The inputs to this element are A, B, Cin (Carryin) while the outputs are Z and Cout (Carryout). The truth table for full subtractor functionality is shown in “FIG. 3” of the drawings.
Serial Adder (SA) and Serial Subtractor (SS)
It performs addition/subtraction of two serial frame, x1(nT), x2(nT) to generate output y(nT) represented as x1(nT)+x2(nT) or x1(nT)−x2(nT). The serial adder (or subtractor) is implemented using a full adder (or subtractor) with a Flip-Flop as shown in “FIG. 3” of the drawings. The output Cout of [FA/FS] is delayed using the [T] element and is applied to Cin line of [FA/FS]. This enables the [FA/FS] and [T] together to function as serial adder (SA/SS), where A, B are the inputs to this element and Z is the output. (e.g of serial addition is as follows, if x1(nT)=0110 (6 in integer) and x2(nT)=0111 (7 in integer). Then y(nT)=01101 (13 in integer representation).
Serial Multiplier (M)
It multiplies two serial input frame X(nT) and m. The output is function represented as Y(nT)=X(nT)*m. A serial coefficient multiplier(M) can be implemented by shift register using [T] elements and adder element [SA] (One shift means multiply by factor of 2). As shown in “FIG. 3” of the drawings, the multiplier is formed by adding the outputs corresponding to ones in the binary representation of the coefficient.
Delay (Z−1)
Delay by one frame of data is done by shift register (series of Flip-flops (T) connected to store and shift the input frame). The number of Unit delay (T) in one delay element is equal to the frame size of the input.
The following description discusses the elements used for implementation of architecture and the existing implementations for digital filters. The proposed minimization is extendable to other applications such as Digital Signal Processing field and Digital designs.
From here onwards, all the illustration would be done with FIR filter which is extendable to other filters as described earlier. “FIG. 4” shows the existing structure of bit serial FIR filter with coefficient lines CLin_0, CLin_1 . . . CLin_n and the coefficient block [A] having the coefficients c(0), c(1), c(2), . . . c(n). The coefficient block is connected to delay element [Z−1] and serial adders [SA] to form filter structure.
Stating the FIR filter equation in time and frequency domainY(n)=c(0)X(n)c(1)X(n−1)+c(2)X(n−2)+ . . . c(n)X(0)Y(z)=X(z)[c(0)+c(1)Z−1+c(2)Z−2+c(3)Z−3+c(4)Z−4+c(5)Z−5+c(6)Z−6+ . . . +c(n) Z−n]where X, Y are the input and output respectively and c(0), c(1) . . . c(n) represent the coefficients value which defines the characteristics of the filter and each delay [Z−1] block represent sample delay of one. The filter equation can be implemented in two ways as shown in “FIG. 4” of the drawings In implementation 1, coefficient lines CLin_0, CLin_1, . . . CLin_n are common and connected to input X[n]. The output lines CLout_0, CLout_1 . . . CLout_n are connected to block [E], consisting of delay element [Z−1] and serial adders [SA] elements. The structure makes easy realization of share-able multiplier in the coefficient block [A]. An example of share-able multiplier with coefficient values 3,11 is illustrated in “FIG. 4”. The realization of these coefficient separately would require 4[T], 3[SA] elements. By virtue of CLin_0, CLin_1, . . . being common, the hardware is realized using 3[T], 2[SA] elements. Another feature of the structure is that the structure inherently requires more storage area, represented by [−Z−1], as compared to implementation 2, since the storage is done after the multiplication. For input frame of n bit and coefficient of size m bit, the storage area of each delay element [Z−1] is (m+n). The total storage space of the delay elements is (m+n)*(number of coefficients −1).
In implementation 2, the coefficient line CLin_0, CLin_1, . . . are not common. By virtue of connectivity of different input lines to all the coefficient elements [c(0), c(1).], the realization of coefficients block [A] using share-able elements is not present. Another feature of this structure is that it inherently requires lesser storage space, represented as [Z−1], unlike in previous implementation, here the storage is done before multiplication. For input frame of m bit and coefficient of size n bit, the storage area of each delay element [Z−1] is (m). The total storage space is (m)* (number of coefficients −1).
The invention is proposed in reducing the area of the coefficient block [A] and have share-able elements in coefficients, even if the coefficient lines CLin_0, CLin_1 . . . are not commonly connected. For existing configuration as shown in “FIG. 7” and “FIG. 8”, the share-ability of hardware in block [A] is a limitation.
Also, as described in previous section, implementation 2 is area efficient with respect to implementation due to reduced delay elements size. Over and above this by having share-able multiplier or reduced coefficient block [A], which are the key features of the invention, implementation 2 becomes still more area-efficient. This reduction is extendable to other filter based on coefficient block [A], as stated in the first section. The present invention operates on integer valued coefficient.
Further, to quote Norsworthy and Crochiere (Delta-Sigma Data Converters IEEE press pp-435, copyright 1997).
“Bit-serial architecture reduce the interprocessor communication down to 1 bit. Generally the number of processors is very large, but because each processor is so small, the overall economy is very high. Bit serial architectures are usually most effective for filters having a few state variables, such as IIR filters and the wave-digital filters. For this reason, bit-serial techniques are less frequently applied to FIR structures, especially when the filter length is relatively long . . . ”.
However, the present invention applies optimization techniques for reducing the area in large sized coefficients by applying a number of optimizations in FIR/IIR filter structures.
To elaborate the applicant's optimization techniques, consider a FIR filter with coefficient as 5, 14, 25, 30, 25, 14, and 5. Though the size of the coefficients in this example is small, it is enough to elaborate the minimization proposals. In most of the practical cases, the coefficients are symmetrical.
Stating the FIR filter equation in time and frequency domainY(n)=c(0)X(n)+c(1)X(n−1)+c(2)X(n−2)+ . . . c(n) X(0)Y(z)=X(z)[c(0)+c(1)Z−1+c(2)Z−2+c(3)Z−3+c(4)Z−4+c(5)Z−5+c(6)Z−6+ . . . +c(n) Z−n]where X, Y are the input & output respectively and c(0), c(1) . . . c(n) represent the coefficients value.
Using the coefficient values in above equationY(n)=5X(n)+14X(n−1)+25X(n−2)+30X(n−3)+25X(n−4)+14X(n−5)+5X(n−6)Y(z)=X(z) [5+14Z−1+25Z−2+30Z−3+25Z−4+14Z−5+5Z−6]  (EQ 1)The Existing Method and Minimization
“FIG. 5” of the drawings shows FIR filter structure of implementation 2. The figure illustrates the realization of FIR filter represented by “Equation I”.
In one of the known optimization technique, is taken advantage of the symmetry in the coefficients. The streams which have to be multiplied with the same coefficients can be added first and then multiplied. For a large filter structure, this leads to a reduction by 45% in the coefficient block. (see “FIG. 6” of the accompanying drawings).
This is done by restructuring the equation as under:Y(z)=X(z)[5*(1+Z−6)+14*(Z−1+Z−5)+25*(Z−2+Z−4)+30*Z−3]  (EQ 2)
For the rest of the optimization proposals it will be talking about only the multiplier adder series which is shown in the dotted box referred to as coefficient block [A]. “FIG. 7” of the drawings shows the traditional way of implementation of the example structure for block [A], wherein S1 to S4 represent the lines connected to delay block [Z−1] through line CLin_0 to CLin_6 depicted in “FIG. 6” of the drawings. The Lines S1 to S4 are separately connected to [T] element for performing a multiplication by a factor of 2 and (SA) is being used to perform serial addition of data. This represents the multiplier less realization of filter coefficient block (A) where the property of flip-flop (T) as multiplier of factor of two is used.
Mathematically, the restructured equation according to the structure is stated asY(nT)=(4+1)S1+(8+4+2)S2+(16+8+1)S3+(16+8+4+2)S4  (EQ 3)
In this implementation, S1, S2, S3, S4 lines are not commonly connected. Hence this restricts to achieve a share-able hardware in coefficient block [A]. Thus all the function/operations of this block represent unique hardware. The elements required by the terms are listed as    First term=2[T], 1 [SA]    Second term=3[T], 2[SA]    Third term=4[T], 2[SA]    Fourth term=4[T], 3[SA]
Final addition of all the four term would require 3[SA].
The generalized structure of “The Existing Method and Minimization” is depicted in “FIG. 8”. In the structure, each column represents a coefficient value. The [T] elements, shown as T1_1 to T1_m in column 1, defines connectivity with line S1. In similar fashion, [T] elements, shown as Tn_1 to Tn_m in column n, defines the connectivity with line Sn.
The presence of one of the elements in columns 1 to n (i.e T1_1 to T1_m, T2_1 to T2_m . . . Tn_1 to Tn_m) is determined by coefficient value. Thus depending on coefficient value on lines S1 to Sn, the number of [T) element in a column is determined. Also the number of serial adders/subtractor [SA/SS] in columns is represented as (SA1_1 to SA1_m, SA2_1 to SA2_m . . . SAn_1 to SAn_m). The presence of one of these elements is again defined by the coefficient value.
In the structure, the [T] elements are arranged in shift register form. The input to first [T] element is connected to one of the S line. While the input to μM SS] is connected from input S1 to Sn and/or one of the output of [T] elements of shift register, depending on the coefficient value. Finally, using SAe_1 to SAe_n-1 elements, the addition/subtraction of [SA/SS] of all the coefficient terms depicted in columns is done. The final output is the output of last addition/subtraction[SA/SS].
Among the lines S1 to Sn, the [T] elements are not share-able and also the [SA] in each column are also not share-able. Thus limited minimization is possible in this structure.
Minimization (Already Applied as Patent)
This structure reduces the hardware of the coefficient block [A] by having shareable elements in coefficients, even if the coefficient lines CLin_0, CLin_1 . . . are not commonly connected. This structure reduces the area by approximately 30–50% of “FIG. 7” of the drawings by reducing the number of components and by having share-ability of components. Here the optimization techniques are illustrated with examples and end of this section depicts the generalized equation and structure of the device.
Continuing the same example of FIR filter and using “Equation 3” of previous section.y(nT)=5*S1+14*S2+25*S3+30*S4Y(nT)=(4+1)S1+(8+4+2)S2+(16+8+1)S3+(16+8+4+2)S4
The applicants proceed to share the shift registers (multiply by 2) of the design.=(S3+S4)*16+(S2+S3+S4)*8+(S1+S2+S4)*4+(S2+S4)*2+(S1+S3)=(S1+S3)+2*(S2+S4+2*(S1+S2+S4+2*(S2+S3+S4+2*(S3+S4)))) (EQ 4).
Finding out the common additive factorsA1=S2+S4A2=S3+S4
The “Equation 4” can be further reduced asy(nT)=(S1+S3)+2*(A1+2*(S1+A1+2*(S2+A2+2*A2))) (EQ 5)
The implementation flow for this equation and the hardware implementation is illustrated here, also the hardware implementation in shown in “FIG. 9” and “FIG. 10” of the drawings [e.g SA(1), SA(2) etc. are used for representing adders, T(1), T(2) etc. are used for representing the unit delay]. In the flow of implementation, S1, S2, S3, S4 represents four inputs. The primary addition is done using serial adders SA(1), SA(3), SA(9) representing addition of terms S1+S3, S2+S4, S3+S4. While the secondary and tertiary addition is done using the adders SA(5), SA(7), SA(8), SA(6), SA(4), SA(2). The multiplication by factor of two is done using the elements T(1), T(2), T(3), T(4).
Implementation Flow of Equation 
Implementation of hardware is shown in “FIG. 9” of the drawings, wherein the input line S1 to S4 represent the lines connected to delay block [Z−1] through coefficient line Clin_0 to CLin_6 depicted in “FIG. 6” of the drawings. The Lines S1 to S4 are connected to block [B] for performing the serial addition/subtraction, for which (SA), (SS) elements are used within block[B]. The output of each block [B] is terminated with a [T] block, which represents the block [B] output being multiplied by a “factor of 2”. The output b_1 of block [B] which is at bit position 0 is fed to the input of the T(1), in turn the output line t_1 of element [T(1)] is fed to next section of block[B]. Thus all addition defines a bit position before getting multiplied by 2 and changing to next bit position. All [T] 1 elements are represented by block[C]. In the structure, the flip-flop [T] representing multiplication by a “factor of 2”, is pushed to share between various coefficient values. Hence reducing the number of flip-flop(T).
In the minimization of “FIG. 9” of the drawings, approximate area calculations is =9 serial adder+4 T=22 Units, whereas the area calculation of “FIG. 7” of the drawings is 11 serial adder+13 T=35 units. (assuming 1 Unit=1 FA=2HA=IT & serial adder=2 Units). This resulted in 37% saving in area (13/35*100).