Algorithms that perform discrete transforms such as Fast Fourier Transforms (FFTs) are well known. The Fourier transform is a mathematical operator for converting a signal from a time-domain representation to a frequency-domain representation. The inverse Fourier transform is an operator for converting a signal from a frequency-domain representation to a time-domain representation. The Discrete Fourier Transform (DFT) may be viewed as a special case of the continuous form of the Fourier transform. The DFT determines a set of spectrum amplitudes and phases or coefficients from a time-varying signal defined by samples taken at discrete time intervals.
As is well known, in the mid-1960's techniques were developed for more rapid computation of the discrete Fourier transform. These techniques became known as the fast Fourier transform (FFT), first described in a paper by J. W. Cooley and J. W. Tukey, entitled “An Algorithm for the Machine Calculation of Complex Fourier Series,” Mathematics of Computation (1965), Vol. 19, No. 90, pp. 297–301. Some patents in the field of processing FFTs include U.S. Pat. No. 3,673,399 to Hancke et al for FFT PROCESSOR WITH UNIQUE ADDRESSING; U.S. Pat. No. 6,035,313 to Marchant for a MEMORY ADDRESS GENERATOR FOR AN FFT; U.S. Pat. No. 6,247,034 B1 to Nakai et al for a FAST FOURIER TRANSFORMING APPARATUS AND METHOD, VARIABLE BIT REVERSE CIRCUIT, INVERSE FAST FOURIER TRANSFORMING APPARATUS AND METHOD, AND OFDM RECEIVER AND TRANSMITTER; U.S. Pat. No. 4,823,297 to Evans for a DIGIT-REVERSAL METHOD AND APPARATUS FOR COMPUTER TRANSFORMS; U.S. Pat. No. 5,329,474 to Yamada for an ELEMENT REARRANGEMENT METHOD FOR FAST FOURIER TRANSFORM; U.S. Pat. No. 5,473,556 to Aguilar et al for DIGIT REVERSE FOR MIXED RADIX FFT; and U.S. Pat. No. 4,977,533 to Miyabayashi et al for a METHOD FOR OPERATING AN FFT PROCESSOR.
In performing a fast Fourier transform of the type known as a radix-two dimension-in-time FFT, the size of the transform is successively halved at each stage. In the illustrative circuit described in FIG. 2, a 32-point FFT is split into a pair of 16-point FFT's, which are in turn split into four 8-point FFT's, then eight 4-point FFT's, and finally sixteen 2-point FFT's. The resulting computation for a 32-point FFT is shown in the signal flow graph of FIG. 2. The quantities on the left hand side of the signal flow graph, ranging from x(0) to x(31) are the sampled inputs to the FFT, while the signals appearing at the right-hand side of the signal flow graph and numbered 0 through 31 are the resulting FFT coefficients. The signal flow graph illustrates that there are five passes or phases of operation, derived from the relationship that the number 32 is two to the fifth power.
The convention used in the signal flow graph is that an arrowhead represents multiplication by the complex quantity Wk adjacent to the arrowhead. The small circles represent addition or subtraction as indicated in FIG. 2a. If the input to each of the butterfly computational modules shown in FIG. 2a is indicated by signal names A and B, and the outputs are indicated by signal names C and D, then the computations performed in the butterfly module are: C=A+BW and D=A−BW. The W values are usually referred to as “twiddle factors” and represent phasors of unit length and an angular orientation which is an integral multiple of 2π/32.
An aspect of FFT computation is that the results of each butterfly computation may be stored back in memory in the same location from which the inputs to the butterfly were obtained. More specifically, the C and D outputs of each butterfly may be stored back in the same locations as the A and B inputs of the same butterfly. This FFT computation is referred to as an “in-place” algorithm. Most discrete transforms are executed “in-place” to conserve memory, which in turn reduces system size, power consumption, cost, and allocates memory for other tasks. For such “in-place” FFTs, the reordering required to counteract the effect of the transform decompositions is achieved by a particular permutation of the elements of the data sequence.
Bit-reversed address mapping is commonly used in performing radix-2 FFTs. When the radix-2 FFT is computed, data must be rearranged in bit-reversed order. If the FFT is performed entirely by software, the FFT process uses an algorithm to pre-place data in memory in bit-reversed order prior to executing the butterfly computations.
Obtaining FFT efficiency is a high priority in the computer processor industry. The FFT algorithm has high intrinsic value and is widely used. The instruction cycle requirement of custom optimized FFT software is the accepted benchmark standard for measuring a processor's computational efficiency. For a specific type of FFT (e.g., in-place, using relocatable data memory, single precision, radix 2, complex, 256 point, unconditional ½ scaling per butterfly, etc.) the number of FFTs/sec executed is a more accurate relative measure of a processor's computational power than MIPs (millions of instructions per second). FFT software requiring fewer resources enhances both the real and projected capabilities of the processor.
Because an optimized FFT computation includes bit reversed addressing, many DSPs (Digital Signal Processors) include customized instructions to facilitate an efficient implementation of bit reversed addressing. Typically, this is done by special instructions that allow address registers to be incremented so that carry (or borrow) bits propagate toward less significant bits (backward). For normal addition carry bits must propagate toward more significant bits. The present invention is primarily intended to optimize FFT software implemented on a processor capable of bit-reversed address register incrementing in the described manner. However, the invention also has applications on processors that lack this capability.
Reference is made to Table I, listing a binary address, contents of memory before bit reversed ordering, the corresponding bit reversed binary addresses, and contents of memory after bit reversed ordering. Assume an input array is stored in 2^(log 2N+M) contiguous words of memory, beginning at start address S_in. The array has 2^ log 2N elements and each element is stored in 2^M contiguous words of data memory. For example, four words of contiguous memory would accommodate two words of precision for both the real and imaginary part of complex input data elements. An arbitrary address for data memory containing the input array can be expressed in the form,
  AR1  =      S_in    +                  [                              B_            ⁢                          (                                                log                  ⁢                                                                          ⁢                  2                  ⁢                  N                                -                1                            )                        *                          2              ^                              (                                                      log                    ⁢                                                                                  ⁢                    2                    ⁢                    N                                    -                  1                                )                                              +                      B_            ⁢                          (                                                log                  ⁢                                                                          ⁢                  2                  ⁢                  N                                -                2                            )                        *                          2              ^                              (                                                      log                    ⁢                                                                                  ⁢                    2                    ⁢                    N                                    -                  2                                )                                              +                      …B_            ⁢            0            *                          2              ^              0                                      ]            *              2        ^        M              +    P  (each binary B_k coefficient can be zero or one, and P=0,1,2, . . . (2^M)−1).The corresponding bit reversed address is obtained by reversing the order of the B_k values:
  AR2  =            bit_rev      ⁢              (        AR1        )              =          S_out      +                        [                                    B_              ⁢              0              *                              2                ^                                  (                                                            log                      ⁢                                                                                          ⁢                      2                      ⁢                      N                                        -                    1                                    )                                                      +                          B_              ⁢              1              *                              2                ^                                  (                                                            log                      ⁢                                                                                          ⁢                      2                      ⁢                      N                                        -                    2                                    )                                                      +                          …B_              ⁢                              (                                                      log                    ⁢                                                                                  ⁢                    2                    ⁢                    N                                    -                  1                                )                            *                              2                ^                0                                              ]                *                  2          ^          M                    +              P        .            
An array has been “bit reversed” after all input data is copied from its original location at address AR1, to its new location at address AR2=bit_rev(AR1). Sequential output array elements are rearranged in bit reversed order relative to the input array. Table I illustrates a bit reversed array for the case log 2N=3, M=S_in=S_out=0. The sequential addresses in the bit reversed address column are obtained by incrementing the prior address with 100 binary, and propagating any carry bit that results backwards. Self-reversed addresses occur when AR1=bit_rev(AR1). The fourth column in Table I illustrates bit reversed addresses AR1 which equal bit reversed addresses bit_rev(AR1) from either self reversal of AR1 addresses, such as binary AR1=1,1,1; and bit reversed sequence addresses that equal some AR1 address other than those that are self reversed, such as bit reversed binary address 0,0,1 equals the bit reversed binary address 1,0,0. For typical processors and software, the output buffer must be “aligned”, i.e., S_out for S_in must be a multiple of 2^(log 2N+M) for bit reversed address register incrementation to work properly.
TABLE IBit Reversed Mapping of an Exemplary Array
Out of place bit reversal (OOPBR) refers to the technique of bit reversing an input data array so that the output data array falls elsewhere in data memory, i.e., S_in ≠ S_out, whereas in place bit reversal (IPBR) refers to the technique of re-ordering elements of an input data array in bit reversed order so that the output array overwrites the input array, i.e. S_in=S_out. For some applications, OOPBR may be advantageous if input data is located in slower, hence cheaper, memory, and faster “scratch” or “volatile” memory is available to generate the bit reversed output array. The subsequent FFT operations on the bit reversed array exploit the faster memory. For this case the cycles required may exceed the benchmark OOPBR FFT cycles, because the digital signal processor (DSP) manufacturer will measure the benchmark case with both the input and output OOPBR array in the fastest memory. An FFT using OOPBR may have a hidden cycle penalty beyond the bit reversal itself, when the output is eventually copied back to the location of the input array. Computational processes that use more of the available scratch memory than necessary can lead to future problems when converting to an operating system that permits multiple computational processes to interrupt each other.
For other applications, the input data for the FFT is already located in fast data memory. For example, the input data may be arrived at as the result of many computations, and for adequate optimization of MIPs (Millions of Instruction cycles Per Second), the FFT input array may already be in fast memory. In that event, OOPBR increases the amount of fast data memory required by the entire FFT by a factor of two. This is the case because the rest of the FFT embodies an intrinsically in place algorithm, requiring no additional data memory other than the input array itself. In the event that the cycles required for IPBR can be made more competitive relative to OOPBR, for many applications the additional data memory requirement of OOPBR cannot be justified.
The second and third columns of Table II illustrate the same sequence of address pairs given in columns one and three of Table I. The conventional IPBR address generator yields these address pairs for N=8. The fourth column indicates which address pairs are needed for IPBR, i.e., unique address pairs referencing data that needs to be swapped. The fourth column of Table II also illustrates that for an array of eight elements, the address pair generator conventionally used for IPBR produces useful address pairs for address pair numbers two and four, which is only two out of eight bit reversed pairs.
TABLE IIConventional IPBR Address Pair Generator Resultsfor an N = 8 Element ArrayBit reversedAddress pair needed for IPBRAddressBinaryBinarymapping array in bit reversedpair numberaddressaddressorder?1000000No, self-reversed2001100YES3010010No, self-reversed4011110YES5100001No, redundant with address pair 26101101No, self-reversed7110011No, redundant with address pair 48111111No, self-reversed
A flawed IPBR algorithm is now described to illustrate the problems encountered attempting to optimize IPBR. The first address register is initialized to S_in, and each iteration of this first address register is advanced linearly to reference the next array element in their natural order. A second address register is also initialized to S_in and is incremented each iteration in a bit reversed manner to obtain the corresponding bit reversed version of the first address. Thus a new pair of addresses is generated each iteration, as illustrated by columns 2 and 3 of Table II. After each bit reversed address pair is generated, the contents of memory referenced by the first and second address registers are exchanged. This technique will work for OOPBR. But for IPBR, all the self-reversed address contents are needlessly exchanged once. All the non-self-reversed address contents are erroneously exchanged twice. The first address register at some point references every element in the array, so if the address pair (A, B) is generated, (B, A) is also generated somewhere in the sequence of address pairs. This flawed IPBR approach exchanges data, referenced by any non-self-reversed address and its bit reversed compliment, not once but twice, resulting in an output array that is equivalent to the input array.
The conventional IPBR algorithm in the prior art involves a modification of this flawed approach. The conventional IPBR algorithm generates address pairs in a manner identical to the described flawed algorithm. However, instead of always swapping the contents referenced by each address pair that is generated, the swap is only executed if the address generated by linear incrementing is less than the address produced by bit-reversed incrementing. Note the criterion of the first address being less than the second identifies the first occurrences of useful address pairs for IPBR in Table II. This condition for swapping eliminates transferring data from self-reversed addresses and prevents swapping for one of the redundant pairs of non-self-reversed addresses. Implementing the conditional swapping typically requires transferring both address registers into accumulators, subtracting, and conditionally branching. For this reason, typical IPBR implementations require two to ten times as many instruction cycles as OOPBR implementations.
The conventional IPBR method is inefficient because it relies on an address pair generator that yields extraneous address pairs.