The present invention relates to an improved block floating point mechanism for an FFT (Fast Fourier Transform) processor.
The FFT is probably one of the most important algorithms in digital signal-processing (DSP) applications. There are two approaches for computing the transform: software implemented on a programmable DSP, and dedicated FFT processor development. Real-time DSP favors the use of the latter, which offers parallel processing capability.
One of the important parts of an FFT processor hardware system is the butterfly processor for arithmetic operation. The FFT butterfly computation operates on data in sets of r points, where r is called the radix. A P-point FFT uses P/r butterfly units per stage for log.sub.r P stages. The computational result of one butterfly stage is the input data of the next butterfly stage. For example, a signal flow diagram of a basic radix-2 butterfly unit is illustrated in FIG. 1, and a signal flow diagram of an 8-point radix-2 FFT processor is illustrated in FIG. 2. The relationship between the inputs A, B and the outputs A', B' of the radix-2 butterfly unit is expressed as: EQU A'=A+BW.sub.N.sup.k EQU B'=A-BW.sub.N.sup.k
where W.sub.N.sup.k is the so-called "twiddle" factor, and all parameters A, B, A', B', and W.sub.N.sup.k are complex variables. The butterfly computation of the 8-point FFT is performed by three butterfly stages I, II, and III, and each stage includes four butterfly units, as shown in FIG. 2. The computational requirements of one butterfly unit are one complex multiply, one complex add, and one complex subtract. As is known, these complex computations have to be changed into real computations, including three real additions, three real subtractions, and four real multiplications.
The block floating point algorithm is widely used in butterfly computation due to its high-speed processing for blocked data. As described above, the butterfly unit includes several multiply, add, and subtract operations, and thus an increase of data range may occur, resulting in an overflow. However, in general. The butterfly processor is made up of fixed-point multipliers and adders. Therefore, guard bits must be provided in the butterfly processor to prevent an overflow error occurring in the computational result of a certain butterfly computation. In addition, the overflow has to be detected in order to appropriately shift the overflowed data, whereby the overflows will not accumulate during multiple-stage butterfly computations. In this manner, the overflow bits will not eventually exceed the guard bits and cause errors.
Since the butterfly units in the same butterfly stage have different data inputs, the overflow bit number a computational result may be different in each different butterfly unit. For example, two-bit overflow, one-bit overflow, or non-overflow may happen in a radix-2 butterfly unit. Because all decimal points of the computing data in every one butterfly stage have to be aligned when the fixed-point butterfly processor is used, these different overflows cannot be shifted individually by different bits. Therefore, the overflows of all resultant data from the butterfly units in the same stage have to be detected to obtain the largest overflow bit number. These resultant data have to be shifted by the largest overflow bits before entering the next-stage butterfly computation. This processing method is called the block floating point algorithm.
A conventional mechanism for implementing the block floating point algorithm is illustrated in FIG. 3. The block floating point mechanism includes a shifter 10, a butterfly processor 20 coupled to the shifter 10, and an overflow detector 30 coupled to the butterfly processor 20 and the shifter 10. The shifter 10 receives the source data to be computed from the memory. The source data for the first-stage butterfly computations are not shifted by the shifter 10, but are sent to the butterfly processor 20 directly. The butterfly processor 20 performs the butterfly computations, and sends out the resultant data at its output. The overflow detector 30 coupled to the output of the butterfly processor 20 detects the overflow of all resultant data. When the final butterfly computation is completed, and the final resultant data is detected by the overflow detector 30, the largest overflow bit number M.sub.1 is obtained and sent to the shifter 10. The resultant data of the first-stage butterfly Computations are sent to the memory, and act as the source data for the second-stage butterfly computations. The shifter 10 retrieves the source data for the second-stage butterfly computations from the memory, and shifts them by M.sub.1 bit(s). The shifted data are sent to the butterfly processor 20 for butterfly computations, and the resultant data are also detected by the overflow detector 30 to obtain the largest overflow bit number M.sub.2 which is in turn sent to the shifter 10. The resultant data of the second-stage butterfly computations are sent to the memory, and act as the source data for the third-stage butterfly computations. In sum, During the k-stage butterfly computations, the shifter 10 retrieves the resultant data of the (k-1)-stage butterfly computations from the memory, and shifts them by M.sub.k- 1 bit(s) to ensure computational correctness. The shifted data are then sent to the butterfly processor 20 for the k-stage butterfly computations, and the resultant data are also detected by the overflow detector 30 to obtain the largest overflow bit number M.sub.k which is in turn sent to the shifter 10. The resultant data of the k-stage butterfly computations are sent to the memory, and act as the source data for the (k+1)-stage butterfly computations. These processes repeat until the butterfly computations for all stages are completed. To avoid the overflows causing computational errors, M.sub.g guard bits are provided in the butterfly processor 20, and the M.sub.g is not smaller than the M.sub.k. Suppose that the data path width of the butterfly processor 20 is m bits. The bit numbers of the processing binary data in the block floating point mechanism of FIG. 3 change as follows: ##STR1##
According to the conventional block floating point mechanism described above, the first butterfly computation for the k-stage cannot start until the final butterfly computation for the (k-1)-stage is completed. This will result in several pipeline wait cycles if the butterfly processor 20 is implemented by the pipeline technology. More finely pipelined butterfly processor causes more waiting times, and thus an additional diminution of computational efficiency.