The cascaded biquad infinite impulse response (IIR) digital filter has been widely used in the field of communications. For example, such digital filters are used to remove noise, enhance communication signals, and/or synthesize communication signals. Compared to the FIR (finite impulse response) filter, an IIR filter can often be much more efficient in terms of attaining certain performance characteristics with a given filter order. This is because the IIR filter incorporates feedback and is capable of realizing both poles and zeroes of a system transfer function, whereas the FIR filter is only capable of realizing the zeroes.
Higher-order IIR filters can be obtained by cascading several biquad sections (or biquad IIR filters) with appropriate coefficients. Another way to design higher-order IIR filters is to use only a single complicated section. This approach is called the direct form implementation. The biquad implementation executes slower than the direct form implementation but generates smaller numerical errors than the direct form implementation. The biquad sections can be scaled separately and then cascaded in order to minimize the coefficient quantization and the recursive accumulation errors. The coefficients and data in the direct form implementation must be scaled all at once, which gives rise to larger errors. Another disadvantage of the direct form implementation is that the poles of such single-stage high-order polynomials get increasingly sensitive to quantization errors. The second-order polynomial sections (i.e., biquads) are less sensitive to quantization effects.
By way of example, a cascaded biquad IIR filter may be implemented on a very long instruction word (VLIW) digital signal processor (DSP), such as the StarCore SC140. Operations of a cascaded biquad IIR filter on a fixed-point DSP, such as the StarCore SC140 DSP, may include MAC (multiply and accumulate) and scaling operations. As is known, the StarCore SC140 is a third generation DSP architecture that deploys a variable length execution set (VLES) execution model. It contains four data arithmetic and logic units (DALUs) and two address generation units (AGUs). It can run up to six instructions per clock cycle (4 DALUs and 2 AGUs). The StarCore SC140 was jointly developed by Lucent Technologies Inc. (Murray Hill, N.J.) and Motorola Semiconductor (Schaumburg, Ill.). Still further, the cascaded biquad IIR digital filter algorithm has been selected by Berkeley Design Technology Inc. (BDTI) as one of the twelve algorithms to benchmark processor performance (e.g., such as that of the StarCore SC140) for the DSP industry.
A fourth-order cascaded biquad IIR filter has the following transfer function (wherein each stage of the cascaded IIR filter is itself a second-order IIR filter):                               H          ⁡                      (            z            )                          =                              ∏                          i              =              1                        2                    ⁢                                                    1                +                                                      b                    i1                                    ⁢                                      z                                          -                      1                                                                      +                                                      b                    i2                                    ⁢                                      z                                          -                      2                                                                                                  1                -                                                      a                    i1                                    ⁢                                      z                                          -                      1                                                                      -                                                      a                    i2                                    ⁢                                      z                                          -                      2                                                                                            .                                              (        1        )            
It is known that in order to make the filter and the inverse filter stable, both the poles and the zeros of H(z) are restricted to be inside the unit circle. This means that coefficients bi1, bi2, ai1, ai2 will be in the range of [−2, 2]. FIG. 1 illustrates the architecture of a conventional fixed-point structure of a cascaded biquad IIR filter implementing the transfer function represented in equation (1) above.
It is to be understood that, for fixed-point implementation, a (m+n) bit number is represented in a Qm.n format. The highest bit represents the sign bit. The next (m−1) bits represent the integer part. The lowest n bits represent the fractional part. A multiplication of two fixed-point numbers, Qm1.n1 and Qm2 n2, produces a Q(m1+m2−1).(n1+n2+1) number. In the conventional implementation of the cascaded biquad IIR filter, (m+n) equals 16 or 32. The input data, coefficients, states and output data are represented with 16 bit precision, and the intermediate data is kept in 32 bit precision. Truncating the lower 16 bits of a 32 bit value produces a 16 bit value. To keep a filter coefficient in the range [−2, 2], the coefficients are represented in a Q2.14 format. The input data is represented in a Q1.15 format.
As mentioned, FIG. 1 illustrates the architecture of the conventional fixed-point structure of the cascaded biquad IIR filter implementing the transfer function represented in equation (1) above. It is to be understood that certain of the filter operations are first generally described below in the context of FIG. 1 and then all the filter operations are described in detail in the context of FIG. 3 with respect to the instruction code shown in FIG. 2. Thus, with reference to FIG. 1, the “Put_h” operation (reference numeral 2 in FIG. 1) deposits the 16 bit value into the higher 16 bits of the 32 bit register. The “Div—2” operation (reference numeral 4 in FIG. 1) scales down a 32 bit value by one bit. The “Mul—2” (reference numeral 6 in FIG. 1) operation scales up a 32 bit value by one bit. The “Ext_h” operation (reference numeral 8 in FIG. 1) extracts the higher 16 bits of a 32 bit value. Further, w1(n−1), w1(n−2), w2(n−1), and w2(n−2) (denoted by reference numerals 10, 12, 14 and 16, respectively, in FIG. 1) are the four 16-bit feedback state values for the cascaded biquad IIR filter. It is to be understood that FIG. 1 actually contains two biquad IIR filter stages 1 and 3. The two filter stages 1 and 3 are in a cascaded configuration, thus forming a cascaded biquad IIR filter. The term “biquad” refers to the fact that each filter stage is a second (bi) order filter with four (quad) filter coefficients.
The corresponding SC140 assembly code is shown in FIG. 2. It is to be understood that the code shown in FIG. 2 is the optimized SC140 kernel code when the conventional fixed-point structure shown in FIG. 1 is used to implement the IIR filter. Register “r0” contains the address for the four Q2.14 format filter coefficients. Register “r1” contains the address for the four Q1.15 format filter states. Execution of this kernel code takes seven cycles per input sample.
In order to explain the data flow associated with the execution of the optimized assembly code of FIG. 2, the nomenclature associated with the filter structure of FIG. 1 is modified, as shown in FIG. 3. Thus, FIG. 3 represents a flow diagram illustrating the data flow of the conventional fixed-point cascaded biquad IIR filter structure shown in FIG. 1. It is to be appreciated that x1(i) denotes that the data is 32 bit long, and xs(i) denotes that the data is 16 bit long.
When the conventional fixed-point structure is used to implement the IIR filter, SC140 DSP takes 7 cycles per input sample. The following is the detailed analysis of the execution of the assembly code. In accordance with the following explanation of FIG. 3, it is to be understood that d0, d1, . . . , d15 are the SC140 DSP's data registers, and r0 and r1 are the pointer registers. A simplified block diagram of the SC140 is shown in FIG. 4, wherein PDB is the program data bus, PAB is the program address bus, ABA is the address bus A, ABB is the address bus B, DBA is the data bus A, and DBB is the address bus B. Further, it is to be understood that the functionality of the adders (each denoted by reference numeral 18 in FIG. 3) and the multipliers (each denoted by reference numeral 19 in FIG. 3) are apparent in the description below of the operation of the IIR filter structure of FIG. 3 when executing the code in FIG. 2. That is, an addition operation provided by an adder 18 is denoted below as “+”, while a multiplication operation provided by a multiplier 19 is denoted below as “*”. Also, during the description below, reference will be made to FIG. 5, which is a flow diagram summarizing the operation of each step of the filtering process.    1. Initially, data register “d0” keeps (saves, stores, holds, etc.) the value of “x1(0).” Pointer register “r0” points to the address where the 8 coefficient values, b11, b12, a11, a12, b21, b22, a21, and a22, are held. Pointer register “r1” points to the address where the 4 state values, w1(n−1), w1(n−2), w2(n−1), and w2(n−2) are held.    2. During cycle 1, SC140 executes the following instruction code (line 20 in FIG. 2): “asr d0,d0 move.4f(r0)+,d4:d5:d6:d7 move.4f(r1),d8:d9:d10:d11” where:            “asr d0,d0” executes “Div—2” and data register “d0” keeps the value of “x1(1);”        “move.4f(r0)+,d4;d5;d6;d7” loads b11, b12, a11, and a12 to data registers “d4,”“d5,” “d6,” and “d7;” and        “move.4f(r1), d8:d9:d10:d11” loads the 4 state values w1(n−1), w1(n−2), w2(n−1), and w2(n−2) to data registers “d8,” “d9,” “d10,” and “d11.”        
The above filter operation is summarized in step 33 of FIG. 5.    3. During cycle 2, SC140 executes the following instruction code (line 22 in FIG. 2): “mac d6, d8, d0 mpy d4, d8, d1 move.4f(r0), d12:d13:d14:d15” where:            “mac d6, d8, d0” executes “x1(1)+a11*w1(n−1)=x1(1)+x1(3),”and keeps the result in data register “d0;”        “mvp d4, d8, d1” executes “b11*w1(n−1)=x1(7),” and keeps the results in data register “d1;” and        “move.4f(r0), d12:d1:d15” loads b21, b22, a21, and a22 to data registers “d12,” “d13,” “d14,” and “d15,”        
The above filter operation is summarized in step 34 of FIG. 5.    4. During cycle 3, SC140 executes the following instruction code (line 24 in FIG. 2): “mac d7,d9,d0 mac d5,d9,d1 mpy d14,d10,d2 mpy d12,d10,d3” where:            “mac d7,d9,d0” executes “d0+a12*w1(n−2)=x1(1)+x1(3)+x1(2)=x1(5),” the result is kept in data register “d0;”        “mac d5,d9,d1” executes “d1+b12*w1(n−2)=x1(7)+x1(8)=x1(9),” the result is kept in data register “d1;”        “mpy d14,d10,d2” executes “a21*w2(n−1)=x1(12),” the result is kept in data register “d2;” and        “mpy d12,d10,d3” executes “b21*w2(n−1)=x1(16),” the result is held in data register “d3.”        
The above filter operation is summarized in step 35 of FIG. 5.    5. During cycle 4, SC140 executes the following instruction code (line 26 in FIG. 2) “mac d15,d11,d2 mac d13,d11,d3 add d0,d0,d0 add d1,d1,d1” where:            “mac d15,d11,d2” executes “d2+a22*w2(n−2)=x1(12)+x1(11)=x1(13),” and keeps the value of “x1(13)” in data register “d2;”        “mac d13,d11,d3” executes “d3+w2(n−2)*b22=x1(16)+x1(17)=x1(18),” and keeps the value of “x1(18)” in data register “d3;”        “add d0,d0,d0” executes “Mu1—2” on the value of “x1(5),” and keeps the result “2*x1(5)=x1(16)” in data register “d0;” and        “add d1,d1,d1” executes “Mu1—2” on the value of “x1(9),” and keeps the result “2*x1(9)” in data register “d1.”        
The above filter operation is summarized is step 36 in FIG. 5.    6. During cycle 5, SC140 executes the following instruction code (line 28 in FIG. 2): “add d0,d1,d10 add d2,d2,d2 add d3,d3,d3 tfr d10,d7” where:            “add d0,d1,d10” executes “2*x1(5)+2*x1(9)=2*x1(10),” and keeps the result in data register “d10,”        “add d2,d2,d2” executes “2*x1(13),” and keeps the result in data register “d2,”        “add d3,d3,d3” executes “2*x1(18),” and keeps the result in data register “d3,” and        “tfr d10,d7” transfers the current state “w2(n−1)=xs(4)” from data register “d10” to data register “d7,” which becomes the state “w2(n−2)” for the next input sample.        
The above filter operation is summarized in step 37 of FIG. 5.    7. During cycle 6, SC140 executes the following instruction code (line 30 in FIG. 2): “add d10,d2,d6 tfr d0,d4 move.f(r1),d5” where:            “add d10,d2,d6” performs “2*x1(10)+2*x1(13)=2*x1(14)=x1(15),” and keeps the result in data register “d6.” The value of “xs(3)” is in the higher 16 bits of data register “d6.” The value of xs(3) is used to update the state “w2(n−1);”        “tfr d0,d4” transfers the value of “x1(16)” from data register “d0” to data register “d4.” The value of “xs(1) is kept in the higher 16 bits of data register “d4.” The value of “xs(1) updates the state w1(n−2). This operation puts the value of “xs(1)” in the correct order for updating the 4 states w1(n−1), w1(n−2), w2(n−1), and w2(n−2)in one SC140 instruction “moves.4f d4;d5;d6;d7,(r1);” and        “move.f(r1),d5” loads the current state “w1(n−1)=xs(2)” into data register “d5” to update the state for next input sample. The sample “w1(n−1)” will become “w1(n−2)” for the next input sample.        
The above filter operation is summarized in step 38 of FIG. 5.    8. During cycle 7, SC140 executes the following instruction code (line 32 in FIG. 2): “add d6,d3,d0 moves.4f d4:d5:d6:d7,(r1)” where:            “add d6,d3,d0” performs “2*x1(14)+2*x1(18)=2*x1(19)=x1(20),” the higher 16 bits of the value of “x1(20)” or data register “d0” keeps the filter output “y(n);” and        “moves.4f d4:d5:d6:d7,(r1)” saves the 4 new states w1(n−1), w1(n−2),w2(n−1), and w2(n−2) in the memory pointed to by pointer register “r1.”        
The above filter operation is summarized in step 39 of FIG. 5.
As mentioned above, the StarCore SC140 DSP has four DALUs and two AGUs. Unfortunately, the kernel code illustrated and described above in the context of FIG. 2 is not able to fully utilize all the available functional units. This is because of a bottleneck condition that is known to occur in the updating operation of the w2(n−1) state which is evidenced by the fact that the conventional IIR filter of FIG. 3 requires seven clock cycles per input sample to execute. This bottleneck is illustrated in the data flow (or dependency) of FIG. 6. As shown, each of the seven operations executed for an input sample can only be performed when the result from the lower level operation becomes available. As seen from the data flow, there are seven levels of dependency for FIG. 3. This means that at least seven clock cycles are needed to filter one sample. To reduce the number of clock cycles, this operation dependency must be broken. However, attempting to break this dependency introduces the problem of updating the w2(n−1) state, which will not be available until the sixth cycle and therefore will not be updated until the seventh cycle, in accordance with the conventional filter implementation.
Accordingly, there is a need for a cascaded biquad IIR filter structure that overcomes such a bottleneck condition and thus increases the processing speed of the DSP or other processing circuitry with which it is implemented.