1. Field of Invention
The present invention relates to a digital signal processing apparatus, and more particularly to a digital signal processing apparatus for implementing Fast Fourier Transformation (FFT) and Inverse Fast Fourier Transformation (IFFT).
2. Description of Related Art
The Orthogonal Frequency Division Multiplexing (OFDM) technique is generally used in wire and wireless communication systems (for example, ADSL, VDSL, IEEE 802.11a, HIPERLAN/2, DAB, and DVB-T). The processing device unit for performing Fast Fourier Transformation (FFT) and Inverse Fast Fourier Transformation (IFFT) operations is one of the important modules in the OFDM technique. Since a large amount of complex computations are required for FFT/IFFT (for example, DVB-T communication system requires an 8192-point FFT operation), it is suitable to implement with hardware.
There appear to be a great variety of different FFT algorithms, such as fixed-radix FFT (FRFFT) or split-radix FFT (SRFFT) algorithms. These algorithm's derivations all make the Discrete Fourier Transform (DFT) computations more efficiently. For split-radix FFT, it has the least computation complexity in traditional FFT algorithms. However the L-shaped structure would render it less suitable for implementation on digital signal processors. Unlike the irregular butterfly structure of SRFFT, FRFFT is simple to analyze and implement with hardware due to its structural regularity. Therefore, the FRFFT is by far more widely used although it involves more computations from the algorithm point of view. The digital signal processing architecture includes two types, a pipeline architecture and a single processing unit architecture. For very high-speed systems, a pipeline FFT configuration is required due to its high throughput rate. However, the pipeline architecture requires more hardware than the single processing unit architecture especially when applying the long-size FFT processing. Thus the manufacturing cost of the pipeline architecture increases. The single processing unit is the simplest memory-based architecture. It is an area-efficient, high performance and low-power architecture. However its controlling mechanism is rather complicated. From hardware design point of view, the single processing unit architecture is more reliable than pipeline structure to design a long-length FFT digital signal processing structure within standard latency-specified time.
FIG. 1A is the signal flow graph of a 16-ponit FFT algorithm. FIG. 1B is the architecture of a general single processing unit. In FIG. 1A, ⊕ and {circle around (×)} represent complex addition and complex multiplication respectively. The input data for performing the FFT operation is required to be stored in a memory 110 previously. The basic computation performed at every stage is called a butterfly. It involves one complex multiplication and two complex additions. In stage S1, the processing unit 120 sequentially reads data pairs, such as [x(0), x(8)], [x(1), x(9)], [x(2), x(10)], . . . , [x(7), x(15)], at a time from the memory 110. Then it performs the butterfly computations and returns the results to the same location in the memory 110. These computations are done in place. After it completes the butterfly operations in stage S1, the processing unit repeats the similar procedure of stage S2 except that the index distance of data pairs will be cut in half. It means that the processing unit sequentially performs butterfly operation for data pairs [x(0), x(4)], . . . , [x(3), x(7)], [x(8), x(12)], . . . , [x(11), x(15)]. For stage S3 and S4, it completes the processing in the similar way. Thus the 16-point FFT algorithm can be achieved using the single processing unit structure. The same method may be applied to FFT operations with variable data length which is power-of-2.
The First Conventional Art:
A digital signal processing architecture with two radix-2 FFT processing units and feedback paths is disclosed in Design and implementation of a scalable fast Fourier transform core (ASIC, 2002. Proceedings. 2002 IEEE Asia-Pacific Conference on, 6-8 Aug. 2002), as shown in FIG. 2A. In FIG. 2B, it describes the conflict-free memory addressing technique for the 64-point FFT algorithm. The data arrangement and the corresponding memory addresses form a circular symmetrical type. This allocation mechanism suits for arbitrary power-of-2 FFT algorithm, and no data-conflict would occur when it applies the single processing unit architecture.
Taking the 64-point FFT operation as an example. First, the input data must be stored into the memory banks according to the memory addressing technique shown in FIG. 2B. It uses the processing unit 220 to perform 4-point FFT operation at a time. Thus it requires 3-stage (log4 64=3) butterfly operations. In stage 1, the address generator 230 generates four addresses 00, 04, 08, 12 for memory banks RAM-A, RAM-B, RAM-C and RAM-D respectively. The kernel then reads data pairs [00, 16, 32, 48] from the memory at a time. It then executes two radix-2 butterfly computations, feedback the results to perform the second butterfly operation and in-place save the results. Repeating the above operations, the processing unit 220 sequentially reads data pairs [49, 01, 17, 33], [34, 50, 02, 18], . . . [47, 63, 15, 31] from the memory banks RAM-A to RAM-D, and rotates the data sequences to [01, 17, 33, 49], [02, 18, 34, 50], . . . [15, 31, 47, 63]. It then uses the two processing units to perform the butterfly operations of the first and second steps. Each time the butterfly operations of the two steps have been completed, the results are in-place written back to the memory banks RAM-A to RAM-D using the rotator.
Next in stage 2, similar operations are performed. The processing unit 220 sequentially reads data pairs [00, 04, 08, 12], . . . [07, 11, 15, 03], [28, 16, 20, 24], . . . [19, 23, 27, 31], [40, 44, 32, 36], . . . [47, 35, 39, 43], [52, 56, 60, 48], . . . [59, 63, 51, 55] from the memory banks RAM-A to RAM-D, and rotates the data sequences to [00, 04, 08, 12], . . . [03, 07, 11, 15], [16, 20, 24, 28], . . . [19, 23, 27, 31], [32, 36, 40, 44], . . . [35, 39, 43, 47], [48, 52, 56, 60], . . . [51, 55, 59, 63]. It then uses two processing units to perform the butterfly operations of the third and fourth steps. Each time the butterfly operations of the two steps have been completed, the results are in-place written back to the memory banks RAM-A to RAM-D using the rotator.
Finally, in Stage 3, similar operations are repeated. The processing unit 220 sequentially reads data pairs [00, 01, 02, 03], [07, 04, 05, 06], . . . [62, 63, 60, 61] from the memory banks RAM-A to RAM-D, and rotates the data sequences to [00, 01, 02, 03], [04, 05, 06, 07], . . . [60, 61, 62, 63]. It then uses two processing units to perform the butterfly operations of the fifth and sixth steps. Each time the butterfly operations of two steps have been completed, the results are in-place written back to the memory banks RAM-A to RAM-D using the rotator. Until then, the 64-point FFT operation has been completed.
The Second Conventional Art:
In A low-power, high performance, 1024-point FFT processor (IEEE J. Solid State Circuits, vol. 34, pp. 380-387, March 1999), a cache is added between the FFT processing unit and the main memory, as shown in FIG. 3. A main memory 310 is used to store the input data for FFT operation. A cache includes a 0th group of caches (cache unit 0A and cache unit 0B) and a 1st group of caches (cache unit 1A and cache unit 1B). Supposing each of them has a storage capacity of 8 data, that is, 8 data can be read at a time, such that a processing unit 320 is viewed as a radix-8 processing unit. When the 0th group of caches has already loaded 8 data from the main memory 310, 4 data for each of the cache units 0A and 0B, and it then uses the processing unit 320 to perform the butterfly operation. The 1st group of caches begins to load the next 8 data from the main memory 310, 4 data for each of the cache units 1A and 1B. When the 1st group of caches begins to use the processing unit 320 to perform the butterfly operations, the 0th group of caches in-place writes the previous results back to the corresponding address of the main memory 310, and also loads the next 8 data from the main memory 310.
The Third Conventional Art:
A digital signal processing apparatus with fewer caches is disclosed in A dynamic scaling FFT processor for DVB-T applications (IEEE J. Solid-State Circuits, vol. 39, pp. 2005-2013, November 2004), as shown in FIG. 4. In FIG. 4, a processing unit 420 is used for performing a radix-8 butterfly operation, such that a main memory 410 has 8 memory banks correspondingly, and a cache 430 is a cache matrix with 8×8 cache units. FIG. 5 is the signal flow graph of a 64-point FFT algorithm with radix-8 butterfly operation. In FIG. 5, each radix-8 butterfly operation (indicated by the circle 500 in the figure, for example) is regarded as performing a three-stage (log28=3) radix-2 butterfly operation. In such a digital signal processing apparatus, the cache 430 reads 64 data from the main memory 410 at a time, which will be sequentially written to the cache 430 in the column direction. After the cache 430 is fully occupied, 8 data are provided to the processing unit 420 each time along the column direction of the cache via a bus BUS for performing the radix-8 butterfly operation, and this procedure is repeated for 8 times. Then, the cache 430 is updated by the results via the bus BUS, that is, Stage 1 in FIG. 5. Then, in Stage 2, the cache 430 outputs the updated 64 data in 8 times along the row direction, and 8 data are provided to the processing unit 420 via the bus BUS at a time for the radix-8 butterfly operation, and then the results are written back to the main memory 410 via the bus BUS and a normalized unit 440. The cache 430 reads 64 data from the main memory 410 at a time for butterfly operation. Therefore, the processing unit 420 and the cache 430 can be regarded as a radix-64 butterfly operation processor.
FIG. 6A illustrates the sequence of data stored in the cache matrix with 8×8 cache units of the cache 430. Each circle in the figure indicates a cache unit, and the numeral in the circle indicates the order for inputting data in the signal flow graph. Referring to FIGS. 5 and 6A, in Stage 1, the cache 430 outputs the data to the processing unit 420 column by column. For example, the cache 430 outputs the 1st column (i.e., 0th to 7th data) to the processing unit 420 for the radix-8 butterfly operation as indicated by the circle 500, and so forth. Each time after one radix-8 butterfly operation has been completed, the processing unit 420 in-place writes the results back to the cache 430 so as to update the data stored-in the cache 430.
FIG. 6B illustrates the sequence of data outputted from the cache matrix of the cache 430 in Stage 2. In Stage 2, the cache 430 outputs data to the processing unit 420 row by row. For example, the cache 430 outputs the 1st row (i.e., the 0th, 8th, 16th, 24th, 32nd, 40th, 48th, 56th data) to the processing unit 420 to perform the first radix-8 butterfly operation in Stage 2. After the operation has been completed, the processing unit 420 writes the results of that row back to the main memory 410 through the bus BUS via the normalized unit 440, and then the cache 430 outputs the 2nd row (i.e., the 1st, 9th, 17th, 25th, 33rd, 41st, 49th, 57th data) to the processing unit 420 to perform radix-8 butterfly operation in Stage 2, and so forth.
To process a large amount of points with FFT, the signal flow graph of the FFT is divided into many groups of data blocks with 64 points as a unit, and there are even-numbered and odd-numbered blocks with different reading/writing rules, so as to utilize the cache high efficiently; and the operating procedure is classified into three configures: the main memory writes the data to the cache; the cache and processing unit perform the butterfly operation upon the data; and the processing unit writes the results back to the main memory. When the main memory writes the data to the cache through the bus BUS, the even-numbered blocks are all sequentially written to the cache along the column direction in 8 steps (column 1, column 2, . . . column 8) with 8 data for each step, whereas the odd-numbered blocks are all sequentially written to the cache along the row direction in 8 steps (row 1, row 2, . . . row 8) with 8 data for each step. Then, the cache exchanges data with the processing unit through the bus BUS to perform the 2-stage operation (i.e., the row operation and the column operation). As for the data in the even-numbered blocks, the butterfly operation in the column direction (column 1, column 2, . . . column 8) is sequentially performed first, and then the butterfly operation in the row direction (row 1, row 2, . . . row8) is performed; and as for the data in the odd-numbered blocks, the butterfly operation in the row direction (row 1, row 2, . . . row8) is performed first, and then the butterfly operation in the column direction (column 1, column 2, . . . column 8) is performed. Similarly, after the butterfly operation of the second stage has been completed, the processing unit sequentially writes the results in the row direction of the even-numbered blocks back to the main memory via the bus BUS. Each time when the results of one row have been written back to the main memory, the main memory will simultaneously write 8 data of the next odd-numbered block to the cache of said row, until the cache is fully occupied by the data of the new odd-numbered blocks. As for the odd-numbered blocks, the processing unit sequentially writes the results in the column direction back to the main memory via the bus BUS. Each time when the results of one column have been written back to the main memory, the main memory will simultaneously write 8 data of the next even-numbered block to the cache of said column, until the cache is fully occupied by the data of the new even-numbered blocks. In actual operations, the column and row processing directions of the even-numbered and odd-numbered blocks of data may be opposite to the above.
In summary, there are at least a few disadvantages of the conventional arts:
In the above first conventional art, no caches are provided in the architecture, and the processing unit must frequently access data from the main memory. Upon frequently accessing the main memory, working efficiency will be reduced and power consumption will be increased.
In the above second conventional art, the data will be less frequently accessed between the processing unit and the main memory by adding caches, and thereby power consumption will be reduced. With two groups of caches, the parallelism of data processing will be further enhanced. However, more circuit area will be occupied and manufacturing costs will be increased. In addition, a complicated control mechanism is required to switch between the two groups of caches in a ping-pong model.
In the above third conventional art, the main memory writes blocks of data to the cache in the column or row direction; the processing unit carries out an alternating operation with 2 stages to the data in the cache; and then the results of the blocks are written back to the main memory along the row or column direction. As the butterfly operation of the block data is repeatedly used, only one group of caches is required, such that the power consumption of the third conventional art will be less than that of the second conventional art. As the processing unit 420, the cache 430, and the normalized unit 440 are all coupled to the same bus BUS, the routing complexity in physical manufacturing is significantly increased, and accordingly, manufacturing costs will also increase. Furthermore, since almost all the members are coupled to the same bus BUS, a more complicated control mechanism is required to control the bus BUS.