(1) Field of the Invention
The present invention relates generally to systems and methods for computing a discrete Fourier transform and, more particularly, to systems and methods for designing and providing systolic array architectures operable for computing the discrete Fourier transform or fast Fourier transform.
(2) Description of the Prior Art
Computational schemes for computing the discreet Fourier transform (DFT) have been intensively studied because of their central importance to many digital signal processing applications. A general purpose computer can be programmed to provide this computation but due to the architectural limitations thereof does not operate quickly enough for many applications. Specialized processing circuits are therefore frequently used for this purpose. Such specialized processing circuits of the prior art are primarily based on architectural computing structures for computing the DFT using the fast Fourier transform (FFT) method. These “pipelined” architectures, in which data flows through a chain of functional arithmetic/memory units, contain many different types of memory and logic components, require computing interconnections that usually span the entire circuit, and require formating of input and out data. Thus, they are irregular and difficult to translate into VLSI hardware. In addition they are only able to compute DFTs with limited choices in the number of transform points.
For the most part, the prior art for determining a DFT is not able to utilize or fully benefit from systolic processing structures that permit simultaneous operation of multiple processors to find the DFT by operating on smaller parts of the problem in parallel. The systolic model, introduced in 1978 by Kung and Leiserson, has proved itself to be a powerful tool for construction of integrated, special purpose processors. However, specialized prior art systolic arrays for computing the DFT, either directly or via FFT techniques require excessive arithmetic hardware, especially for large transform sizes, and are slow both in terms of computational latency (time to do a single DFT computation) and computational throughput (time between successive DFT computations).
A systolic architecture as used herein refers to an array of systolic cells, such as simple processors, which can be implemented as integrated circuits to provide efficient, parallel use of a large number of processors to thereby perform multiple simultaneous computations. The data passes from processing element to processing element. Once data has been read from external memory into the array, several computations can be performed on the same datum. The mode of computation is architecturally different from that which performs a memory access each time a data item is used by one or relatively few processors—the latter being a characteristic of the von Neumann architecture. Thus, the systolic array has a major advantage over traditional architectures which are limited by the ‘von Neumann bottleneck.’
The processing elements in systolic arrays operate synchronously whereby each cell receives data from its neighboring cells, performs a simple calculation, and then transmits the results (only to its neighbors) one cycle later. Only those cells that are at the edge of the network communicate with the external world, e.g., with I/O circuitry. Thus, the processing elements connect only to the closest neighboring processing elements which may be typically be located north-south, east-west, northeast-southwest, and/or northwest-southeast. In a systolic array, local interconnections from each processing element are each of a length approximately equal to the size of each processing element. In other words, the interconnections are short compared to the dimensions of the integrated circuit. The simplified regular local interconnections of systolic arrays are relatively easier to implement in silicon wafers than complex interconnections found in many circuits. Presently available field programmable gate array (FPGA) based hardware provides enormous flexibility in making design choices.
Prior art non-systolic arrays for computing DFTs or FFTs have utilized an irregular design structure that is not scalable. Because the structure of the prior art circuits is irregular, it is not practical to utilize the same prior designs, only with increased size, to produce a new circuit that would compute a DFT with a different transform length, i.e., the design is not scalable. Therefore, design costs for a specialized application of a non-systolic array remain high because an entirely new design is generally required any time the transform length is changed. Moreover, prior art non-systolic array designs for computing DFTs are severely limited in application because the transform length which they can calculate has to satisfy the restriction N=rm where N is the transform length, r is the radix, and m is an integer. This exponential restriction significantly limits the practical transform lengths that can be computed, especially for large transform sizes. For example, if r=2 transform lengths between N=4096 (m=12) and N=8192 (m=13) would not be directly computable. Many prior art non-systolic designs use r=4 because this permits significant arithmetic simplification, but practical transform lengths are even more severely limited.
Systolic array designs have been more costly to implement in the prior art for computing DFTs at least in part because they require at least the same number of fixed point complex multipliers as the transform length to be computed. Complex fixed point multiplier processing elements are more costly than complex fixed point adders because they consume considerably more chip area and power. Fixed point means that each transform point is represented by a specific number of binary digits. Thus a word of 16 bits can represent numbers up to 216. Floating point numbers allow a lot more flexibility because they use a set of exponent bits to provide a much wider range of values that a specific number of bits can represent. But working with two floating point numbers adds considerable complexity, because the exponents have to be compared and “aligned”, which involves a shift of one of the words. Consequently, floating point processing elements have considerable additional complexity that makes them less desirable for in multiple processor applications. The efficiency of such designs in terms of the area of the circuit (which is mainly determined by the number of multipliers) and the time to do successive DFTs, is relatively low. Moreover, the processing elements in prior art systolic array designs often have to be capable of performing both adding and multiplying functions thereby further increasing the cost and circuit area of implementation.
The following patents show attempts to solve the above or related problems:
U.S. Pat. No. 4,777,614, issued Oct. 11, 1988, to J. S. Ward, discloses a digital data processor for matrix-vector multiplication, and comprises a systolic array of bit level, synchronously clock activated processing cells each connected to its row and column neighbors. On each clock cycle, each cell multiplies an input bit of a respective vector coefficient by a respective matrix coefficient equal to +1, −1 or 0, and adds it to cumulative sum and carry input bits. Input vector coefficient bits pass along respective array rows through one cell per clock cycle, Contributions to matrix-vector product bits are accumulated in array columns. Input to and output from the array is bit-serial, word parallel, least significant bit leading, and temporally skewed. Transforms such as the discrete Fourier transform may be implemented by a two-channel device, in which each channel contains two processors of the invention with an intervening bit serial multiplier. Processors of the invention may be replicated to implement multiplication by larger matrices. This type of systolic array requires N complex multipliers and N complex adders, takes at least N time-steps to compute successive DFTs, and has a latency of 2N−1 to do a single DFT. These numbers are far higher than for the systolic method described hereinafter.
U.S. Pat. No. 6,098,088, issued Aug. 1, 2000, to He et al, discloses a real-time pipeline processor, which is particularly suited for VLSI implementation, is based on a hardware oriented radix-22 algorithm derived by integrating a twiddle factor decomposition technique in a divide and conquer approach. The radix-22 algorithm has the same multiplicative complexity as a radix-4 algorithm, but retains the butterfly structure of a radix-2 algorithm. A single-path delay-feedback architecture is used in order to exploit the spatial regularity in the signal flow graph of the algorithm. For a length-N DFT transform, the hardware requirements of the processor proposed by the present invention is minimal on both dominant components: Log4N−1 complex multipliers, and N−1 complex data memory.
U.S. Pat. No. 6,023,742, issued Feb. 8, 2000, to Ebeling et al. discloses a configurable computing architecture with its functionality controlled by a combination of static and dynamic control, wherein the configuration is referred to as static control and instructions are referred to as dynamic control. A reconfigurable data path has a plurality of elements including functional units, registers, and memories whose interconnection and functionality is determined by a combination of static and dynamic control. These elements are connected together, using the static configuration, into a pipelined data path that performs a computation of interest. The dynamic control signals are suitably used to change the operation of a functional unit and the routing of signals between functional units. The static control signals are provided each by a static memory cell that is written by a host. The controller generates control instructions that are interpreted by a control path that computes the dynamic control signals. The control path is configured statically for a given application to perform the appropriate interpretation of the instructions generated by the controller. By using a combination of static and dynamic control information, the amount of dynamic control used to achieve flexible operation is significantly reduced.
U.S. Pat. No. 5,098,801, issued Mar. 3, 1992, to White et al., discloses a modular, arrayable, FFT processor for performing a preselected N-point FFT algorithms. The processor uses an input memory to receive and store data from a plurality of signal-input lines, and to store intermediate butterfly results. At least one Direct Fourier Transformation (DFT) element selectively performs R-point direct Fourier transformations on the stored data according to the FFT algorithm. Arithmetic logic elements connected in series with the DFT stage perform required phase adjustment multiplications and accumulate complex data and multiplication products for transformation summations. Accumulated products and summations are transferred to the input memory for storage as intermediate butterfly results, or to an output memory for transfer to a plurality of output lines. At least one adjusted twiddle-factor storage element provides phase adjusting twiddle-factor coefficients for implementation of the FFT algorithm. The coefficients are preselected according to a desired size for the Fourier transformation and a relative array position of the arrayable FFT processor in an array of processors. The adjusted twiddle-factor coefficients are those required to compute all mixed power-of-two, power-of-three, power-of-four, and power-of-six FFTs up to a predetermined maximum-size FFT point value for the array which is equal to or greater than N.
The non-systolic prior art designs discussed above for computing DFTs are irregular, non-scalable, difficult to design, and costly to implement. Prior art systolic designs are more costly and less efficient. Prior art systolic designs require the same number, or more, of complex multipliers than the transform length to be performed. It would be highly desirable to reduce the number of multipliers by at least a factor of four as compared to prior art systolic designs. Prior art systolic systolic designs have not been able to take advantage or radix-4 structures or other bases. The prior art systolic designs require more complicated processing elements that are required to multiply and add. It would be desirable to be able to provide a hardware implementation of a systolic array whereby the processing elements are only required to multiply or add, but not both, and the required number of multipliers is significantly reduced. The base-4 systolic design disclosed herein exploits the computational efficiency of a radix-4 butterfly, yet transform lengths must only be a multiple of sixteen, compared to traditional radix-4 designs which must satisfy N=4m, where N is the transform length and m is an integer. Consequently, those skilled in the art will appreciate the present invention which provides solutions to the above and other problems.