1. Field of the Invention
This invention relates to software implementations of discrete time filters, and in particular to software implementations of a Finite Impulse Response (FIR) filter on a general purpose processor.
1. Description of the Relevant Art
Traditional implementations of discrete time filters for signal processing applications have used a custom Digital Signal Processor (DSP) instruction to implement an N-tap filter. Such a DSP instruction is executed to perform a multiply-accumulate operation and to shift the delay line in a single cycle (assuming the delay line is entirely in zero-wait state memory or on-chip). For example, on a T1320C50 DSP, a finite impulse response (FIR) filter is implemented by successive evaluations of an MACD instruction, each evaluation computing an element, yn, of the filtered signal vector, i.e., of the output vector, y[K], such that:                               y          n                =                                            ∑                              N                -                1                                                    i              =              0                                ⁢                                    h              i                        ⁢                          x                              n                -                i                                                                        (        1        )            
where h[N] is the N-tap filter coefficient vector and x[K] is an input signal vector.
Unfortunately, for many portable device applications such as Personal Digital Assistants (PDAs), portable computers, and cellular phones, power consumption, battery life, and overall mass are important design figures of merit. In addition, very small part counts are desirable for extremely-small, low-cost consumer devices. Signal processing capabilities are desirable in many such portable device applications, for example to provide a modem or other communications interface, for speech recognition, etc. However, traditional DSP implementations of such signal processing capabilities create increased power demands, increase part counts, and because of the power consumption of a discrete DSP, typically require larger heavier batteries.
An efficient implementation of a Finite Impulse Response (FIR) filter on a general purpose processor allows a discrete Digital Signal Processor (DSP), together with the cost, size, weight, and power implications thereof, to be eliminated in device configurations (such as communications device configurations) requiring signal processing functionality and digital filter structures. In particular, an efficient implementation of an FIR in accordance with the present invention allows a single general purpose processor (e.g., any of a variety of processors including MIPS R3000, R4000, and R5000 processors, processors conforming to the Sparc, PowerPC, Alpha, PA-RISC, or x86 processor architectures, etc.) to execute instructions encoded in a machine readable media to provide not only application-level functionality, but also the underlying signal processing functionality and digital filter structures for a communications device implementation. Of course, multiprocessor embodiments (i.e., embodiments including multiple general-purpose processors) which similarly eliminate a DSP are also possible. In one embodiment in accordance with the present invention, an FIR filter implementation on a general purpose processor provides digital filter structures for a software implementation of a V.34 modem without use of a DSP.
In general, a general purpose processor provides an instruction set architecture for loading data to and storing data from general purpose registers, for performing logical and scalar arithmetic operations on such data, and providing instruction sequence control. Application programs, as well as operating systems and device drivers, are typically executed on such a general purpose processor. In contrast, a digital signal processor is optimized for vector operations on vector data, typically residing in large memory arrays or special purpose register blocks, and is not well suited to the demands of application programs or operating system implementations. Instead, a digital signal processor typically provides a vector multiply-accumulate operation which exploits highly-optimized vector addressing facilities. In contrast, a general purpose processor provides neither a vector multiply-accumulate operation nor vector addressing facilities necessary for computing a ynth element and shifting through vector data in a single cycle. Instead, an N-tap filter implemented in a straightforward manner for execution on a general purpose processor computes each output vector element using 2N reads from memory to processor registers, N multiply-accumulates, and one write to memory. To calculate K elements, such an N-tap filter implementation makes K(2N+1) memory accesses and KN multiply-accumulates. For each multiply-accumulate, more than two memory accesses are required.
It has been discovered that a Finite Impulse Response (FIR) filter can be implemented in software on a general purpose processor in a manner which reduces the number of memory accesses. In particular, an efficient implementation for a general purpose processor having a substantial number of registers includes inner and outer loop code which together make   K  ⁡      [                            (                                                    L                1                            +                              L                2                                                                    L                1                            ⁢                              L                2                                              )                ⁢        N            +                        L          2                          L          1                    +      1        ]  
memory accesses and KN multiply-accumulates, where L1 is the number of output vector elements computed during each pass through the outer loop and where L2 is the number of taps per output vector element computed during each pass through the inner loop. The efficient implementation exploits L1+2L2 general purpose registers. For an exemplary embodiment wherein L1=L2=8, i.e., using 24 general purpose registers, inner and outer loop code make   K  ⁡      (                  N        4            +      2        )  
memory accesses, which for filter implementations with large numbers of taps, approaches a 4xc3x97 reduction in the number of memory accesses.