One of the challenges in implementing a Digital Radio Mondiale (DRM) application in a communication unit is the implementation of a Fast Fourier Transform (FFT) algorithm and/or an Inverse Fast Fourier Transform (IFFT) algorithm that are required in the baseband processing of an orthogonal frequency division multiplexed (OFDM) receiver. This is a challenging task as several of the supported FFT lengths are not a ‘power-of-two’, and the FFT when decomposed does not yield a multiple-of-2 FFT. Therefore, in these cases, a non-standard implementation approach to FFTs needs to be adopted. Table 1 below provides some examples of the FFT lengths that need to be supported with different DRM transmission modes.
TABLE 1DRM transmission modeFFT lengthA576B512C352D224E432
It is known that implementation of FFT lengths of the power-of-2 are easily realizable and typically exist in a form of libraries supplied by a vendor of a digital signal processor (DSP) that supports FFT and/or IFFT functionality. However, some of the DRM required FFT lengths in Table 1 are not of the power-of-2, and also not available as libraries. Therefore, an efficient realization of a FFT implementation on a DSP requires special techniques to exploit the best utilization of a given processor architecture.
The FFT is a faster implementation of a Discrete Fourier Transform (DFT), whose equation can be defined as a sequence of N complex numbers x0, x1, . . . xN−1 that is of the form of [1]:
                                          X            k                    ⁢                      =            def                    ⁢                                    ∑                              n                =                0                                            N                -                1                                      ⁢                                          x                n                            ·                              e                                                      -                    2                                    ⁢                                                                          ⁢                  π                  ⁢                                                                          ⁢                                      ikn                    /                    N                                                                                      ,                                  ⁢                  k          ∈                      ℤ            ⁢                                                  ⁢                          (              integers              )                                                          [        1        ]            
The DFT computes frequency values (namely Xk) in a given input time domain sequence (namely xn), and the term e−2πikn/N used in the equation [1] is referred to as twiddle-factors. A twiddle factor, in FFT algorithms, is any of the trigonometric constant coefficients that are multiplied by the data in the course of the algorithm. It is known that the FFT efficiently implements the DFT, by exploiting symmetry in its twiddle factors.
A well-known FFT algorithm is the “divide and conquer” approach proposed by Cooley-Tukey in ‘J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex Fourier series,” Mathematics of computation, vol. 19, pp. 297-301, April 1965’. This method is used for FFTs that have a length that is a power of the radix (e.g., two for radix-2). If other lengths are required, a mixed-radix algorithm can be used. For example an FFT-288 can be re-expressed with a radix-2 and radix-3 FFT (e.g. the 288-point FFT can be decomposed to FFT-32×FFT-9). A further more-efficient approach was introduced by Good in ‘I. J. Good, “The Interaction Algorithm and Practical Fourier Analysis”, Journal of the Royal Statistical Society. Series B (Methodological) Vol. 20, No. 2 (1958), pp. 361-372’ in order to eliminate the intermediate multiplications required in the Cooley-Tukey approach. This algorithm is sometimes known as the Prime Factor Algorithm (PFA).
It is known that non-power-of-two FFTs can be generalized to a group of 2-dimensional PFA-decomposable DFTs of the form of equation [2]:N=N1·N2=(2p+1)·2q.  [2]
Table 2 provides an overview of a selection of the FFTs that can be generated with parameters ‘p’ and ‘q’ of the PFA equation [2], with those required for DRM depicted as underlined.
TABLE 2qp45672 80160320 6403112224448 89641442885761152517635270414086208416832166472404809601920
For DRM software to have a good performance in terms of the FFT computation time, memory, and power, an efficient non-power-of-two FFT implementation is required. Known reconfigurable co-processors have been developed to support non-power-of-two FFT realizations, as illustrated in FIG. 1 with the simplified arrangement 100 of a known PFA decomposed FFT. This FFT algorithm recursively re-expresses a DFT of length N=N1×N2, into smaller DFTs of size N1 120 and N2 130. The lengths of the small DFTs N1 120 and N2 130 have to be co-prime and can be implemented with an arbitrary algorithm. Good's mapping in equation [2] is used to convert N=N1×N2× . . . ×NL point DFT into a L-dimensional DFT equation and optimizes the PFA for the number of calculations to be performed. However, Good's mapping in equation [2] assumes that the input data 102 is ordered in Ruritanian Correspondence (RC) order by RC function 110, and output data in Chinese Remainder Theorem (CRT) order by CRT function 150, or vice versa. Thus, the simplified arrangement 100 routes 112 the respective ordered data bits to a first N-point DFT 120 of size N1 and thereafter a second N-point DFT 130 of size N2, before the output data 152 is reordered by CRT function 150.
However, for many applications, such as an application that is required to support the five DRM transmission modes, use of a co-processor to solely implement a FFT function, and support non-power of 2 FFT computations and the FFT when decomposed that does not yield a multiple-of-2 FFT, adds an undesirable increase in cost of the solution.
Implementations of FFTs of length 576, 512, and 432 are available as libraries from the vendors of processors. However, the DRM FFT lengths of 352 and 224 are not available as third party libraries. Hence, a solution is needed to implement (at least for a DRM solution) the FFT 352 and FFT 224, for example optimized for a particular single instruction, multiple data (SIMD) vector processor. A vector processor, or array processor, is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors. Here, each element of the vector feeds a single unique processing element, or the processing elements are lined up in a vector form to operate on the vector data. This arrangement is in contrast to a scalar processor, whose instructions operate on single data items.
Referring now to FIG. 2 a flowchart 200 illustrates a known operation for implementing a regular 352-point FFT using a PFA decomposed FFT on a DSP. Here, using a PFA decomposed FFT on a DSP, the FFT 352 can be decomposed into smaller FFTs, namely: FFT 11×FFT 4×FFT 8, where N1=11, N2=4, N3=8, and where the implementation of FFT4 and FFT8 is readily achievable. However, the FFT designer is required to devise techniques for deriving the best use of the processor or processing elements in order to implement the FFT11 operation. The flowchart 200 commences in step 202 with input data of 352 data points. At a first stage in 204, 32 instances of 11-point FFT are computed as [3]:FFT11a(k)=FFT11(x(32n+a))  [3]Where: a is the instance number=0, 1, . . . , 31;                n=0, 1, . . . , 10 to generate the 11 Input values;        k=0, 1, . . . , 10 to generate the 11 output values.        
The output of FFT11 is processed by FFT4 in a second stage in 206, and, for a given FFT11, each of the 11 outputs goes to a different FFT4 module, as illustrated in FIG. 1. Thus, the second stage in 206 contains 88 instances of the 4-point FFT, which are computed as [4]:FFT4b(l)=FFT4(FFT11a(b),FFT11a+8(b)*tw0(b),FFT11a+16(b)*tw1(b),FFT11a+24(b)*tw2(b))  [4]Where: b is the instance number=0, 1, . . . , 87;                a is used on computation of the instance number of FFT11;        a=0, 1, . . . , 7;        l=0, 1, . . . , 3 generates the 4 output values; and        tw0(b), tw1(b), and tw2(b) are twiddle factors for ‘b’ th instance of FFT4 in 206, as illustrated in [5], [6], [7].FFT4b(l)=FFT4(FFT110(k),FFT118(k)tw0(b),FFT1116(k)*tw1(b),FFT1124(k)*tw2(b))  [5]        for k=0, 1, . . . 10; and for b=0, 1, . . . , 10FFT4b(l)=FFT4(FFT111(k),FFT119(k)*tw0(b),FFT1117(k)*tw1(b),FFT1125(k)*tw2(b))  [6]        for k=0, 1, . . . , 10; and for b=11, 12, . . . , 21FFT4b(l)=FFT4(FFT117(k),FFT115(k)*tw0(b),FFT1123(k)*tw1(b),FFT1131(k)*tw2(b))  [7]        for k=0, 1, . . . , 10; and for b=77, 78, . . . , 87        
The flowchart 200 then comprises a third stage in 208 that consists of 44 instances of 8-point FFT that are computed as in [8]:FFT8c(m)=FFT8(FFT4a(b),FFT4a+12(b)tw0(c),FFT4a+24(b)tw1(c),FFT4a+36(b)*tw2(c),FFT4a+48(b)*tw3(c),FFT4a+60(b)tw4(c),FFT4a+72(b)*tw5(c),FFT4a+84(b)*tw6(c),)  [8]Where:                c is the instance number=0, 1, . . . , 43;        a is used on computation of the instance number of FFT4;        b=0, 1, . . . , 3;        m=0, 1, . . . , 7 to generate the 8 output values; and        tw0(c), tw1(c), tw2(c), tw3(c), tw4(c), tw5(c) and tw6(c) are twiddle factors for ‘c’ th instance of FFT8.        
The output of the third stage at 208, when re-arranged, generates the overall FFT output at 210. However, the inventor of the present invention has recognised and appreciated that such a known approach will not provide an optimal implementation, primarily because the fetching and processing of data will not be done in multiples of ‘Y’ data points, e.g. fetching and processing in multiples of ‘4 data points in the above FFT 352 scenario where the vector processor under consideration had 4 parallel multiplier units, which would significantly ease the complexity and speed of FFT processing.
Thus, an efficient DSP implementation is desired for an embedded system, and/or a communication unit, together with methods for implementing FFTs that support non-power of 2 FFT computations and where FFTs of a particular length are not available as libraries from the processor vendor.