1. Field of the Invention
The present invention relates generally to a Signal Processing, and more particularly, to a Digital Filtering.
2. Description of the Related Art
A digital filter receives an input sequence of samples denoted by x(n) and performs a convolution with the filter's impulse response denoted by h(n) to produce the filtered output y(n). When the filter's impulse response is finite in duration, the filter is referred to as a Finite Impulse Response (FIR) filter. When the filter's impulse response is infinite in duration, the filter is referred to as an Infinite Impulse Response (IIR) filter.
The filtering operation for an FIR filter with impulse response h of length M can be expressed mathematically asy(n)=Σk=0M-1h(k)×(n−k)  (1)
There are two conventional methods for implementing the filter in (1). These two methods, which are outlined below are the time-domain and transform domain methods. The time domain methods process the signal in the time domain, while the transform domain methods transform the input signal to another domain, usually frequency domain, and perform the equivalent of the filtering operation in the transform domain, and then transform the result back to the original domain of the signal, usually the time domain. The transform used to convert a signal from the domain of the input signal to another domain is referred to as the input transform. The transform used to convert a signal from some domain to the domain of the output signal, is referred to as the output transform. In general the input and output domains are the same, which makes the input transform and output transform to be the inverse of each other.
Time-Domain Structures
The two traditional structures for implementing the FIR filter (1) in time domain are the direct form and transposed form. These structures can be found in Proakis et. al. (J. G. Proakis, D. G. Manolakis, “Digital Signal Processing”, third edition, Prentice Hall, ISBN 0-13-373762-4) and Oppenheim et. al. (A. V. Oppenheim, R W. Schalfer, “Discrete-Time Signal Processing”, second edition, Prentice Hall, ISBN 0-13-754920-2). Efficiencies can be introduced to the structures in the form of reducing the number of multipliers when the impulse response of the filter is symmetric (h(0)=h(M−1), and in general h(n)=h(M−n−1)) or antisymmetric (h(0)=−h(M−1), and in general h(n)=−h(M−n−1)). The structures that take advantage of the symmetry or anti-symmetry of the inpulse response are also illustrated in Proakis et. al. and Oppenheim et. al. Other time domain structures to bring efficiencies to the filter implementation are also known that are based on performing some pre-processing on the input, performing sub-filtering on the pre-processed signals, and post processing the sub-filtered singals to generate the output. These techniques are described in Parket et. al. (D. A. Parker, K. K. Parhi, “Low-Area/Power Parallel FIR Digital Filter Implementations,” in Journal of VLSI Signal Processing 17, 75-92, 1997) and Mou et. al. (Z-J Mou, P Dumahel, “Short-Length FIR Filters and Their Use in Fast Nonrecusrive Filtering” in IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 39, NO. 4, JUNE 1991). In this technique, the overall throughput rate of the input to the pre-processor is generally lower than the overall throughput rate at the output of the pre-processor. For example if the sampling rate of the pre-processor input(s) and output(s) is the same, then usually the number of output streams will be greater than the number of input streams. The input signal to the pre-processor can be contained in one stream (i.e., one signal) and passed to the pre-processor as a single input or it can be contained in multiple streams (i.e., multiple signals), but collectively representing the same input signal. For example a time domain signal can be passed to the pre-processor as a single input, or it can be passed in two distinct streams as two distinct signals, the first stream being comprised of all the even-time samples (0, 2, 4, 6, . . . ) of the input signal, while the second stream being comprised of all the odd-time samples (1, 3, 5, 7, . . . ) of the input signal. It is also possible to construct signals that share common components, but collectively represent the input signal. Therefore, in general the pre-processor introduces some redundancy in representing the input signal, which may seem inefficient, but this redundancy enables breaking of the overall filtering operation into sub-filters to operate on the multiple streams produced by the pre-processor. The sub-filters operating on the pre-processor output(s) are derived from the original filter, where these sub-filters have impulse responses that may be shorter than the original filter. Finally the outputs of the sub-filters get combined in post processing to construct what would be the filter output. The general properties of the post-processor with respect to relationship of its input(s) and output(s) are the reverse of those of its corresponding pre-processor. It can be thought of removing the redundancy from the sub-filtered signals (which is related to the redundancy introduced by the pre-processor) to construct the output signal. Again the output signal can be presented at the output of the post-processor as a single stream (i.e., one signal) or multiple streams collectively representing the otput signal. The efficiency in this technique comes from the fact that the sub-filters are shorter (requiring less processing) and they operate at lower sampling rates compared to the overall sampling rate of the input (which, in a hardware implementation translates to power savings and/or hardware savings when resources are shared to perform the computations). There are different decompositions of the filter into sub-filters that yield different pre-processing and post-processing structures, different number of sub-filters, different impulse response lengths, and different processing speeds for the sub-filters. We will refer to the family of these filters as Reduced-Complexity N-Parallel (RCNP) filter structures. Two example structures for the RCNP decompositions that create 3 sub-filters are shown in FIG. 2a and FIG. 2b. In FIG. 2a the input signal is processed by the pre-processor 210, generating the three streams, each of which are processed by the sub-filters 221 222 223, and the outputs of the sun-filters are processed by the post-processor 230 to generate the output signal. The sub-filters 221 222 223 are processing at half the rate of the original filter. Similarly FIG. 2b depicts a structure with a pre-processor 240, sub-filters 251 252 253, and a post processor 260. Note that the same (or different) decomposition can be also applied to each of the sub-filters h0, h0+h1, and h1. Here h0 denotes the impulse response h0(n)=h(2n) (i.e., the even-indexed samples of h), h1(n)=h(2n+1) (i.e., the odd-indexed samples of h), and h0+h1 is h0 (n)+h1(n). If h has an even length, then h0, h1, and h0+h1, will all have half the length of h. The efficiencies are drawn from the facts that these sub-filters are half the length of h AND they operate at half the rate of h. For example, using power consumption as the resource of interest, if h0, h1, and h0+h1, each consume ¼ of the power consumed by h, then collectively they consume ¾ of the power consumed by h. Assuming the pre- and post-processing consumes negligible power, this particular RCNP decomposition will yield approximately 25% power savings.
Transform-Domain Structures
In addition to time domain techniques, there are known transform domain techniques, referred to as overlap-and-add and overlap-and-save that are described in Proakis et. al. and Oppenheim et. al. The efficiencies in these transform domain structures come from the fact that the filtering operations process a block of input samples and/or generate a block of output samples at a time. This means the operations performed are shared for the processing of all input samples in the block and/or generation of all samples in the output block.
Let us assume the filter implemented by the overlap-and-save method has an impulse response h of length M. Then we define a vector h having length T=L+M−1, where h is obtained from h by taking the impulse response h and appending it with L−1 zeros. The result is the vector h of length T whose first M samples are those of h and last L−1 samples are zeros. h is the zero-padded version of h. Then we obtain the vector H of length T by taking the T-point transform, commonly the Fast Fourier Transform (FFT) of the zero-padded vector h. The vector H does not change from iteration to iteration when the filter's impulse response it not changing, rather the filter is operating on the filter's input stream. The vector H needs to change only when the filter coefficients (i.e., the impulse response h) needs to change. Assume the vectors are column vectors with the top element corresponding to the earliest element while the bottom element corrsponding to the latest element. To process the input signal using the overlap-and-save method, a vector x of length T at iteration t is constructed by taking the last (or bottom) M−1 samples from the vector x at the previous iteration t−1 and positioning these M−1 samples as the first (or top) M−1 samples of x and filling the rest of the L samples from the next L samples of the filter's input stream to complete the T samples of x. The construction of the block x from the stream x is accomplished by the input stream to block constructor 310 in FIG. 3. Note that at each iteration we take L samples from the filter's input stream. With these definitions of x and H, the filter generates L samples for the filter output as follows. x is passed to an input transform 320 in FIG. 3 (T-point FFT) which generates the T-point vector X. Then X and H are processed with the equivalent of the filtering operation in the transform domain, which for the FFT the X and H are multiplied element-by-element 360 in a transform domain processor 330 to generate the T-point vector Y. Then Y is passed to an output transform 340 (T-point Inverse Fast Fourier Transform (IFFT)) to generate the T-point vector y. Discarding the first (or top) M−1 points of y, the remaining last (or bottom) L points of y produce the L output samples of the filter in the stream y, which is accomplished in the output block to stream constructor 350 in FIG. 3. This processed is repeated again to process the next set of L input samples and generate the next set of L output samples. It should also be noted that for the very first iteration zeros are used for the first (or top) M−1 samples for x.
The overlap-and-add method is similar to the overlap-and-save, in the sense that it also processed a block of input samples to generate a block of output samples. The main difference is that in the overlap-and-add method the input vector x is constructed by taking L samples from the filter's input stream and zero padding it with M−1 zeros. On the contrary, at the output, the first M−1 points are added with the last M−1 points of the previous iteration. For more details one may refer to Proakis et. al. and Oppenheim et. al. Both the overlap-and-add and the overlap-and-save methods can be depicted as in FIG. 3, where the FFT 320, element-by-element multiplication 360 in the trnsform domain processor 330, and IFFT 340 operate on vectors (or blocks of samples). Note that in some implementations of these blocks the block operations can be done in sequence, but the overall operations accomplish the equivalent of the vector (or block) operations.
The overlap-and-add and overlap-and-save methods described in Proakis et. al. and Oppehheim et. al. use a transform size that is larger than the impulse response of the filter. A technique of partitioning the filter up and applying the transform domain filter (overlap-and-add or overlap-and-save) is shown in Joho et. al. (M. Joho, G. S. Moschytz, “Connecting Partitioned Frequency-Domain Filters in Parallel or in Cascade” in the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 47, NO. 8, AUGUST 2000) where the FFT size can be shorter than the filter impulse response. This technique is based on partitioning the impulse response into sections of length P, where the first P samples of h constitute h0 the next P samples constitute h1 and so on. It should also be noted that if the filter h has a length that is not a multiple of P, it can be extended by zero padding to have a length that is multiple of P. The key observation is that now the broken up impulse responses each can be implemented via transform domain implementation, and the transform size needs to be greater than P (not M) and P can be made arbitrarily small. Furthermore, the transforms required for each section of the filter can be shared, so only one forward and/or one inverse transform needs to be implemented. This technique, and the derivations of the structures based on this technique are found in Joho et. al. and we will refer to them as the Partitioned Time Domain (PTiD) and the Partitioned Transform Domain (PTrD) structures. FIG. 4a shows an examplary filter having an impulse response of length M=NP, while FIGS. 4b and 4c, show the corresponding Partitioned Time Domain (PTiD) and the Partitioned Transform Domain (PTrD) structures, respectively, where N=4. The examples in FIGS. 4b, and 4c correspond to the choice of M=4P (i.e., N=4). This choice is for illustration purposes only and the present invention applies to an arbitrary choices for M, P, and N. The PTrD example structure of FIG. 4c is comprised of input stream to block constructor 401, the input transform 402, the partitioned transform domain processor 413, the output transform 414, and the output block to stream constructor 415. The partitioned transform domain processor 413 is comprised of block delays 403 404 405, element-by-element multipliers 406 407 408 409 operating on blocks, one for each partition, and block combiners 410 411 412 to produce the output block of the partitioned transform domain processor 413. A functionally equivalent but alternative structure to FIG. 4c is obtained by applying transposition to the partitioned transform domain processor 413. The resulting PTrD structure is depicted in FIG. 4d, where the partitioned transform domain processor 453 is obtained by transposing the transform domain processor 413 in FIG. 4c. The PTrD structure of FIG. 4d is comprised of input stream to block constructor 441, the input transform 442, the partitioned transform domain processor 453, the output transform 454, and the output block to stream constructor 455. The partitioned transform domain processor 453 is comprised of block delays 447 448 449, element-by-element multipliers 443 444 445 446 operating on blocks, one for each partition, and block combiners 450 451 452 to produce the output block of the partitioned transform domain processor 453. The more general representation of the PTrD structures are depicted by FIG. 4e and FIG. 4f, where N partitions are used. The PTiD structure of FIG. 4b also has it's equivalent structure based on transpotion and can also be generalized similarly to how FIG. 4e and FIG. 4f are generalized.