The present invention relates to digital signal processing apparatus and methods, and, more particularly, to apparatus and a method for block floating point Fast Fourier Transform.
Fast Fourier Transform (FFT) is an efficient algorithm for transforming discrete time signals from a time domain representation to a frequency domain representation, and vice versa. The term xe2x80x9cFast Fourier Transformxe2x80x9d is actually a generic term for an entire set of efficient Discrete Fourier Transform (DFT) algorithms. The principle upon which these algorithms are based is that a DFT can be recursively decomposed into smaller DFT""s. The most popular decomposition method is the radix-2 Decimation In Time (DIT) FFT.
It is customary to represent a DIT FFT by a prior art flow graph made up of simple graph units known as xe2x80x9cbutterfliesxe2x80x9d, as shown in FIG. 1. A butterfly flow graph has lines (such as a line 102), circles (such as a circle 104), and arrows (such as an arrow 106) as elements. The circles represents summation and the arrows show the direction of data flow. If an arrow has an adjacent number or expression (such as the constant xe2x88x921 or the expression WNk), the data value is multiplied by that number or expression. Wherever an arrow does not have an adjacent number or expression, this is considered as an implicit multiplier of 1. Thus, in FIG. 1, an output data value vector 108 (c, d) can be expressed in terms of an input vector 110 (a, b) of data values as follows:
c=a+WNkxc2x7b xe2x80x83xe2x80x83(1) 
d=axe2x88x92WNkxc2x7b xe2x80x83xe2x80x83(2) 
where all the numbers are in general complex numbers. Note that each output of the butterfly is a sum of two numbers, for example c is the sum of a and WNkxc2x7b, where the term WNkn is known in the art as a xe2x80x9ctwiddle factorxe2x80x9d and is defined as
WNkn=exe2x88x922xcfx80jnk/N xe2x80x83xe2x80x83(3) 
where j in Equation (3) represents the imaginary number. The magnitude of WNk is unity. An important characteristic of the product F=WNkxc2x7b such as in Equation (1) and Equation (2) is that the multiplication can change the magnitude of the components (real or imaginary) by a factor of 2, because the multiplication involves complex numbers. That is, the larger component of the product F can grow up to 2 times the larger component of b:
max (Freal, Fimag)xe2x89xa62 max (breal, bimag) xe2x80x83xe2x80x83(4) 
The main problem of the FFT is that the dynamic range of the complex output data values grows by 2. The dynamic range is the range between the maximum possible value and the minimum possible value:
abs (c)xe2x89xa62xc2x7max (abs (a), abs (b)) xe2x80x83xe2x80x83(5) 
abs (d)xe2x89xa62xc2x7max (abs (a), abs (b)) xe2x80x83xe2x80x83(6) 
Such growth can cause overflow when writing numbers to data memory, and it is necessary to prevent such overflows.
A flow graph of a prior art DIT FFT is shown in FIG. 2. The FFT size in this figure is N=8. As seen in FIG. 2, an input 208 to the FFT flow graph is a vector x (n) of complex numbers. This vector passes through log2(N) stages, each of which is made of N/2 butterflies. In this example, there is a first stage 202, a second stage 204, and a third stage 206, culminating in an output 210. Each output node of each stage involves the addition of two data values, as illustrated in FIG. 1 and in Equation (1) and Equation (2). The data values involved are, in general, complex random variables with similar distributions. Although the standard deviation of each output node grows by a factor of 2 (assuming the inputs are independent), the dynamic range of the output grows by a factor of 2. When working with a fixed point digital signal processor (DSP), such an increase in the dynamic range of the numbers might cause overflows when writing them to the data memory, unless there is a mechanism for overflow protection. (An overview of the field is given in Digital Signal Processing, by A. V. Oppenheim and R. W. Schaffer, Prentice-Hall, in the chapter covering quantization effects in fixed-point FFT algorithms.) The dynamic range of the magnitude of the complex output grows by a factor of 2 but the dynamic range of the larger of the output""s components (real or imaginary) grows by a factor of 1+2. The term xe2x80x9cprocessorxe2x80x9d herein denotes any data processing device, including, but not limited to digital signal processors.
The simplest solution for the overflow problem is to divide the FFT input vector by N. This guarantees no overflow for all stages of the algorithm, but suffers from low performance in terms of signal to quantization noise ratio (SQNR), where quantization noise refers to the effects of finite word length. Another solution is to divide (xe2x80x9cscale downxe2x80x9d) the output data values of each stage by 2, which also guarantees no overflow for all stages of the algorithm. This solution also has better SQNR performance than dividing the input vector by N, but still does not have enough performance for some applications, such as ADSL (Asymmetric Digital Subscriber Line), especially in 16 bit processors. A third solution, which is the best in terms of SQNR, is to adopt the block floating point technique and attain what is known as block floating point (BFP) FFT. In BFP FFT, the scaling is not done at the output of every stage but only at those stages where overflow occurs (or might occur). That is, if overflow occurs in stage k then the whole stage is recalculated such that every output is recalculated, scaled and then stored in the data memory. This improves the SQNR while preserving the dynamic range to prevent overflow.
The problem with this approach is that it is not efficient for real time implementations. The number of cycles varies from execution to execution and depends on the number of scaled stages in each execution.
The best prior art solution in the current DSP""s available in the marker is to modify the decision law for scaling such that recalculation will not be required. As previously noted, in the classical BFP FFT the decision law states that if overflow occurs in the current stage, recalculate and scale before storing to the data memory. In the best prior art solution, recalculation is avoided by making a decision whether to scale down the output data value of stage k on the basis of the output data values of stage kxe2x88x921. By determining in advance that an overflow might occur in a stage before actually performing the computation of that stage, the output of that stage is scaled down regardless of the result, and the time otherwise wasted in performing an unnecessary computation can be saved. This solution is used, for example, in the Motorola DSP56xxx family and 56lxx family. Processors which support this algorithm contain a scale-before-store (SC) bit and a sticky status bit. (A sticky status bit is a status bit that can be set, but not cleared, by a particular condition, and which can be cleared only by a software command, such as a program instruction, or by a hardware reset, and which therefore retains a record of a positive test for the particular condition regardless of subsequent negative tests for that condition.) The term xe2x80x9csetxe2x80x9d and the terms xe2x80x9cclearxe2x80x9d (or xe2x80x9cclearedxe2x80x9d) herein denote distinct preassigned logical states, without any limitation on their specific respective binary or physical representations. For example, a set bit can be represented by a binary 1 and cleared bit can be represented by a binary 0; alternatively, a set bit can be represented by a binary 0 and a cleared bit can be represented by a binary 1. Likewise, as another example, a set bit can be represented by a high voltage level and a cleared bit can be represented by a low voltage level; alternatively, a set bit can be represented by a low voltage level and a cleared bit can be represented by a high voltage level. Any consistent distinct states may be utilized to represent a xe2x80x9csetxe2x80x9d bit and a xe2x80x9cclearedxe2x80x9d bit.
The prior art decision process is illustrated conceptually in FIG. 3. Data required for the process includes an FFT sticky status bit (FFTS) 320, a scale-before-store (SC) bit 332, a rounding adjustment 324, and a comparison constant 308. Rounding adjustment 324 is a predetermined constant. Rounding adjustment 324 is not part of the FFT block floating point algorithm, but is needed for good performance.
FIG. 5 illustrates the execution unit output data value partition for the example of a 16-bit processor for both the present invention and for the prior art. For a 16-bit processor, an EU (execution unit) output data value 502 has 40 bits, consisting of an 8-bit extension 504 from bit 32 to bit 39 (sometimes referred to as xe2x80x9cguard bitsxe2x80x9d), a 16-bit high part 506 from bit 16 to bit 31, and a 16-bit low part 508 from bit 0 to bit 15. Bit 0 is the least significant bit in the EU output data value. It is high part 506 which is stored in data memory (unless there is a specific command or mode to do otherwise). In order to insure that high part 506 is rounded to the nearest least significant bit (bit 16), the value 8000 hexadecimal is added to low part 508. Thus, in this non-limiting example, rounding adjustment 324 equals 8000 hexadecimal. If bit 15 equals 0, addition of rounding adjustment 324 will not alter high part 506, but if bit 15 equals 1, addition of rounding adjustment 324 results in a carry into high part 506. Thus, the addition operation results in high part 506 being rounded to the nearest least significant bit (bit 16). (Later, prior to being stored in memory in a step 315, low part 508 is truncated in a truncation operation 319.)
The process starts by clearing FFTS 320 in a clearing step 300, and selectively setting or clearing SC 322 in a step 301. Following this, the process moves to the beginning of an outer loop 302-B, which handles each stage. The process continues with the beginning of an inner loop 304-B, which handles each output component from the stage.
Within each stage, there is the beginning of another loop 304-B for each vector component. At the start is a test 311 to see if SC 322 is set. If SC 322 is set, then 2 times rounding adjustment 324 is added in a step 323. This compensates for the scale down by a factor of 2 in a step 313. These steps are implemented in some prior art processors, such as those of the Motorola DSP56xxx family and 56lxx family, in which the rounding is done before the scale down by 2. In such cases, the rounding unit adds twice the value of rounding adjustment 324, to compensate for the scale down.
Next, as previously discussed, rounding adjustment 324 is added in a step 702 if SC 322 is not set, and 2 times rounding adjustment 324 is added in a step 323 if SC 322 is set. The magnitude of the real and imaginary parts of each output component is next compared with comparison constant 308 in a decision point 306. If the real or imaginary part of any component is greater than comparison constant 308, then FFTS 320 will be set in a step 310. (This will result in the output of the next stage being scaled won by a factor of 2 before being stored in data memory). In step 319 low part 508 (FIG. 5) is truncated, as previously discussed. Then, in a step 315, the component is stored in data memory, thereby ending the inner loop at 304-E. Note that for the first stage, the scale down by 2 depends on whether or not SC 322 has been preset by the programmer.
Prior to the end of each loop of 302-E, a check is made to see if FFTS is set, at a decision point 303. If FFTS 320 is set, then SC 322 is set in a step 305, and FFTS 320 is cleared in a step 307. Note that this part of the process could alternatively be done prior to the beginning of loop 304-B. It is noted in general that FIG. 3 and the accompanying description are conceptual, and that different practical implementations are possible.
The value of comparison constant 308 is predefined and has an impact on the total SQNR of the FFT. Lowering the value of comparison constant 308 will increase, on the average, the number of stages that are scaled and therefore lower the SQNR. Raising the value of comparison constant 308 will improve the SQNR but will increase the probability that an overflow will occur in spite of the precautionary scaling. In the existing DSP""s which implement this solution, comparison constant 308 is fixed in the hardware and is equal to 0.25 (where the dynamic range is [xe2x88x921,1)). It can be shown that 0.25 is a lower bound that guarantees no overflow for all stages of the algorithm.
The problem with this solution is that comparison constant 308 is fixed in the hardware and there is no way for the programmer to change the comparison constant to accommodate different conditions. In some applications a comparison constant with the fixed value of 0.25 achieves insufficient SQNR (for example DMT ADSL), especially in 16 bit processors. Better values of the comparison constant are theoretically available, however, in many cases. First, a larger general lower bound exists (for any FFT length) that improves, by default, the SQNR of the algorithm. Second, for any FFT length and input signal distribution, a different xe2x80x9chighest lower boundxe2x80x9d exists.
For example, in a 16 point (four-stage) FFT, the worst case input (the case where the output receives the maximum value) is
x=c*[1.0000xe2x88x921.0000i
1.0000xe2x88x921.0000i
1.0000xe2x88x921.0000i
1.0000xe2x88x921.0000i
xe2x88x921.0000xe2x88x921.0000i
xe2x88x921.0000xe2x88x921.0000i
xe2x88x921.0000xe2x88x921.0000i
xe2x88x921.0000xe2x88x921.0000i
xe2x88x921.0000+1.0000i
xe2x88x921.0000+1.0000i
xe2x88x921.0000+1.0000i
xe2x88x921.0000+1.0000i
1.0000+1.0000i
1.0000+1.0000i
1.0000+1.0000i
1.0000+1.0000i]
for cxe2x88x921 and without scaling at any stage, the FFT output is                                                                                                                                     xe2x80x83                                        ⁢                                          [                                              xe2x80x83                                            ⁢                      0                      ⁢                                              xe2x80x83                                                                                                                                          0                                                                                          0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -                                                                                  5.9864                                                                                                                                                                -                                                                                                                                                                  4.0000                                                                                  ⁢                                                                                  i                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -                                                                0.7956                                                                                                                            -                                                                                                                              4.0000                                                                ⁢                                                                i                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    0                                                                                                                                                                                                                                                                                                                                                                                          0                                                                                                                                                                                                                                                                                                                                          0                                                                                                                                                                                                                                                                                                                                2.6727                                      -                                                                              4.0000                                        ⁢                                        i                                                                                                                                                                                                                                                                                                                    0                                                                                                                                                                                          0                                                                                                                                          0                                                                                                                        20.1094                -                                  4.0000                  ⁢                  i                                            ]                                          
Because the dynamic range is [xe2x88x921,1], the output overflows. Introducing the input vector x to a BFP FFT with a comparison constant value of 23/20.1094, and hence defining cxe2x89xa623/20.1094 guarantees no overflow at any stage of this example. Since this is a worst case for this 16 point FFT the value 23/20.1094 is a lower bound for this FFT size. The fixed comparison constant value of 0.25 used in prior art processors is thus not optimum for this example, since 0.25 less than 8/20.1094≈0.3978. There is thus room for improvement in the SQNR by raising the comparison constant, but the programmer has no means of realizing this improvement, because there is no way to control this important aspect of DSP performance. This example illustrates the limitations of the prior art block floating point FFT implementations in achieving optimum performance and giving the programmer control over the block floating point FFT execution. Furthermore, the prior art solution is limited to supporting only Radix-2 block floating point FFT.
There is thus a widely recognized need for, and it would be highly advantageous to have, a mechanism for block floating point FFT which achieves a better signal to quantization noise ratio by permitting optimal run-time adjustment of the comparison constant as well as support for other FFT structures. This goal is attained by the present invention.
The present invention solves the problem of limitations on the signal to quantization noise ratio and lack of flexibility for the programmer to control the performance of block floating point FFT applications.
The present invention is of a mechanism for improving the SQNR in BFP FFT algorithms implemented on DSP processor by giving the programmer run-time control over the value of the comparison constant, and therefore over the algorithm""s performance. This is done by adding a user-loadable FFT compare (FFTC) register to the processor, either to the processor""s execution unit (EU) or else outside the execution unit itself, in addition to the dedicated mode FFT bit (FFTB), the compare absolute value unit, and the FFT sticky status bit (FFTS) that already exist in current processors that support BFP FFT algorithms.
The programmer is thus able to write a program executed by the processor, in which the program can control the loading of the FFTC register. The FFTC register can thus contain a programmable comparison constant.
The FFT compare unit compares the absolute value of each number that is stored to the data memory to the FFT compare register FFTC, and, depending on the result of the comparison, the FFTS may be set. For example, if the absolute value of a number written to data memory is bigger than the FFTC value and if the BFP FFT mode is set, then the FFTS is set. This informs the programmer that at least one of the outputs of the current stage exceeds the compare value. Under program control, the FFTS will be cleared and a shifter unit will be configured to scale down by 2 (shift right by one) all the numbers written to data memory in the next stage.
Therefore, according to the present invention there is provided a processor for performing a block floating point FFT on a plurality of data values, the processor executing a program, the processor including: (a) an FFT compare register, the FFT compare register operative to containing a programmable comparison constant which can be loaded under control of the program, the programmable comparison constant having a first magnitude; (b) an execution unit having an output data value, the output data value having a second magnitude; (c) a compare absolute value unit for comparing the second magnitude to the first magnitude; (d) a scale down by 2 unit for dividing the output data value by a factor of 2; (e) a scale-before-store mode for activating the scale down by 2 unit; and (f) an FFT sticky status bit for indicating that the second magnitude has exceeded the magnitude of the first magnitude.