FFT (Fast Fourier Transform) processing is carried out in a base station apparatus of a portable telephone system or a broadcast device for digital broadcasting. The high-throughput and efficient execution of FFT is sought in such devices.
A method of using a radix-2 or radix-4 butterfly arithmetic unit to carry out butterfly computation is known as one method for executing high-throughput FFT.
FIG. 1 shows the processing configuration of a 16-point FFT that uses a butterfly arithmetic unit. As shown in FIG. 1, 16-point FFT can be executed by a radix-4 butterfly computation in two stages. FIG. 2 shows the processing configuration of a 32-point FFT that uses a butterfly arithmetic unit. As shown in FIG. 2, a 32-point FFT can be executed by two stages in a radix-4 butterfly computation and one stage of a radix-2 butterfly computation. In FIGS. 1 and 2, the intersection of two lines represents a radix-2 butterfly computation and the intersection of four lines at a blank circle represents a radix-4 butterfly computation.
In FIGS. 1 and 2, the butterfly computations of each stage are hereinbelow considered to be processed in order from butterfly computations at the top of the figure.
To maximize FFT throughput, efficient use of the butterfly arithmetic unit is preferably achieved by supplying data to the butterfly arithmetic unit for each cycle with as few interruptions as possible. It is effective to treat a plurality of data as row data and supply data to a butterfly arithmetic unit while reading and writing row data that are input/output data or intermediate data to a memory that can read and write one row of data in one cycle. For example, when carrying out four parallel FFT processes, it is effective to treat four units of data as row data and use a memory that stores the four units of data d(4k), d(4k+1), d(4k+2), and d(4k+3) to address k.
However, it is the nature of FFT that outputs are collected from a plurality of butterfly computations of a previous step as the input of the butterfly computation of the succeeding step or that the output is taken from the butterfly computation of a previous step as the inputs of a plurality of butterfly computations of the succeeding step. Accordingly, in FFT, butterfly computations must be carried out with the data of discontinuous indices as the input and output. As a result, it is often impossible to achieve sufficient performance by means of only a row data memory.
For example, in the 16-point FFT shown in FIG. 1, the output of the first butterfly computation of the first stage, i.e., the uppermost blank circle of the first stage in FIG. 1, becomes the input of all four butterfly computations of the second stage. In addition, the input of the first butterfly computation of the second stage is composed of the output of all four butterfly computations of the first stage.
In order to carry out such butterfly computations efficiently, it is demanded that the order of data among a plurality of row data are efficiently rearranged or permutated. One method of rearranging data among a plurality of row data is a method of implementing a transposition process upon memory input/output.
JP-A-2008-537655 discloses a technique of using a transposition memory to rearrange data. In JP-A-2008-537655, a transposition memory enables collection of data among different row data in one row data and distribution of data among a single row data to different row data.
As a more specific example, the transposition of a four-cycle portion of row data can be carried out as shown below.
First, a four-cycle portion of row data shown in Formula (1) is stored.(4h,4h+1,4h+2,4h+3),(4i,4i+1,4i+2,4i+3),(4j,4j+1,4j+2,4j+3),(4k,4k+1,4k+2,4k+3)  (1).
Next, the transposition of the row data of Formula (1) converts the data to the row data shown in Formula (2):(4h,4i,4j,4k),(4h+1,4i+1,4j+1,4k+1),(4h+2,4i+2,4j+2,4k+2),(4h+3,4i+3,4j+3,4k+3)  (2).
A case is considered in which these data are used in the 16-point FFT shown in FIG. 1. Input data in memory are typically arranged in numerical order as: x0, x1, x3, . . . . Row data shown in Formula (3) that have been read from the memory in which input data have been stored in this way are transposed to the data shown in Formula (4). The row data of Formula (4) become the input of the first-stage butterfly computation.(x0,x1,x2,x3),(x4,x5,x6,x7),(x8,x9,x10,x11),(x12,x13,x14,x15)  (3)(x0,x4,x8,x12),(x1,x5,x9,x13),(x2,x6,x10,x14),(x3,x7,x11,x15)  (4)
The input of the second-stage butterfly computation is obtained by carrying out the same transposition for the output of the first-stage butterfly computation.
JP-A-2003-150576 discloses a technique for efficient execution of rearrangement among row data by improving the method of mapping data to intermediate buffers. This technique also carries out transposition in small data units such as 2×2.
However, there are cases in existing data rearrangement methods in which rearrangement could not be carried out efficiently when a plurality of FFTs of different numbers of points are mixed. More specifically, there are cases in which intervals must be opened between the data rearrangement of a particular row of data and the data rearrangement of the next row of data to avoid collision.
For example, when the second-stage process of the 32-point FFT shown in FIG. 2 is carried out by a radix-4 butterfly computation, the time taken in data rearrangement is six cycles. The actual operations of data rearrangement will be described later. When the second-stage process of the 16-point FFT shown in FIG. 1 is carried out by a radix-4 butterfly computation, the time taken for data rearrangement is three cycles. The actual operations of the data rearrangement will be described later.
When a 16-point FFT is carried out following a 32-point FFT in which the time taken in data rearrangement differs, an interval of at least three cycles must be opened for switching in the input of data to the data rearranging circuit to avoid data collision. Thus, when a plurality of FFT having different numbers of points are mixed, throughput falls due to interruptions of data.
JP-A-10-283341 discloses the configuration and operation of an existing data rearranging circuit. In the technique disclosed in JP-A-10-283341, the data rearranging circuit uses delay circuits and a switch circuit (i.e., shuffle circuit) to rearrange data. FIG. 3 is a schematic view of the data rearranging circuit disclosed in JP-A-10-283341. In addition, JP-A-06-342449 and JP-A-2002-504250 also disclose the configurations and operations of data rearranging circuits that similarly employ delay circuits and shuffle circuits.
Referring to FIG. 3, a portion of the input data is directly applied as input to a shuffle circuit, and the remaining input data are applied as input to the shuffle circuit by way of first-stage delay circuits. A portion of the output of the shuffle circuit directly becomes output data, and the remaining output data becomes output data by way of second-stage delay circuits.
The data rearranging circuit of JP-A-10-283341 carries out rearrangement of data by 2-parallel rearrangement or 4-parallel rearrangement, and can carry out rearrangement according the number of points that are processed by switching the arithmetic mode. In JP-A-10-283341, the amount of delay of each delay circuit is fixed in the same arithmetic mode. Although no mention is made regarding switching of the number of points of FFT, switching of the arithmetic mode must be carried out such that collisions of the output data of the data rearranging circuit are avoided. As a result, when switching the number of points of FFT, the input of data to data rearranging following switching must wait, and increase in throughput is therefore not possible.
FIG. 4 is a timing chart showing the state of data rearrangement for a typical FFT process that uses a circuit for rearranging 4-parallel data. The circuit for rearranging data for a typical FFT process is shown in FIG. 3 as one example. In the example of shown in FIG. 4, rearrangement is first carried out for obtaining data that are the input to the second-stage butterfly computation of 32-point FFT. Rearrangement is then carried out for obtaining data that are the input to the second-stage butterfly computation of 16-point FFT.
FIG. 4 shows the input to the first-stage delay circuits, input to a shuffle circuit, input to the second-stage delay circuits, and output of the second-stage delay circuits, in the circuit of FIG. 3. The input/output ports of each unit are the four ports #0 to #3. In addition, the input data for 32-point FFT are represented by the data names A0 to A31, and the input data for 16-point FFT are represented by the data names B0 to B15.
The input to the data rearranging circuit is the row data shown in Formula (5) and Formula (6) that has undergone transposition every four cycles.(A0,A1,A2,A3),(A8,A9,A10,A11),(A16,A17,A18,A19),(A24,A25,A26,A27),(A4,A5,A6,A7),(A12,A13,A14,A15),(A20,A21,A22,A23),(A28,A29,A30,A31)  (5)(B0,B1,B2,B3),(B4,B5,B6,B7),(B8,B9,B10,B11),(B12,B13,B14,B15)  (6)
In the data rearrangement for 32-point FFT shown in the first half of FIG. 4, delays of 0, 2, 4, and 6 cycles are conferred to the inputs of ports #0, #1, #2, and #3, respectively, in the first-stage delay circuits. The output of the first-stage delay circuits is then applied as input to the shuffle circuit. In the shuffle circuit, data are switched or permutated between ports in the same cycle and the output thereof is applied to the second-stage delay circuits. In the second-stage delay circuits delays of 6, 4, 2, and 0 cycles are conferred to the inputs of ports #0, #1, #2, and #3, respectively.
In the data rearrangement for 16-point FFT shown in the second half of FIG. 4, delays of 0, 1, 2, and 3 cycles are conferred to the inputs of ports #0, #1, #2, and #3, respectively, in the first-stage delay. In the shuffle circuit, data are switched or permutated among the ports in the same cycle. In the second-stage delay circuit, delays of 3, 2, 1, and 0 cycles are conferred to the inputs of ports #0, #1, #2, and #3, respectively.
By means of this rearrangement, data rearrangement is realized for the input of the second stage of 32-point FFT and for the input of the second stage of 16-point FFT. For example, (A0, A2, A4, A6) supplied as output in cycle 6 becomes the input of the uppermost butterfly computation of the second stage shown in FIG. 2.
Nevertheless, the delays differ for data rearrangement for 32-point FFT and data rearrangement for 16-point FFT, as described hereinabove. As a result, data for 16-point FFT cannot be continuously applied as input to data rearranging circuit after the data for 32-point FFT. To avoid data collisions, data cannot be applied as input for an interval of three cycles as shown in cycles 8 to 10 of the first-stage delay input shown in FIG. 4, and the throughput of the FFT process therefore drops. This drop in throughput becomes more significant with increase in the frequency of occurrence of switching to FFT of different numbers of points.
JP-A-2005-235045 discloses a technique of using a ring buffer to carry out data rearrangement. However, JP-A-2005-235045 discloses a method in which rearrangement and butterfly computations are realized by software and makes no disclosure regarding a method of efficient rearrangement by hardware. In JP-A-2005-235045, input data of one series are stored in order in a ring buffer, and output data are rearranged by supplying under the control of software. Although this method allows the switching of the time order of data, this method is not practical for parallel installation by hardware due to the large amount of hardware. In JP-A-2005-235045, moreover, a degree of freedom is afforded to the order of execution of rearrangement and FFT through the use of both a ring buffer of the same size as the number of points of FFT and two data buffers for the data that are the object of computation. However, it is inevitably impractical to realize the resulting total of three buffers by hardware due to the increase in the amount of hardware.