The present invention relates to Digital Signal Processing (DSP) in general, and more particularly to a DSP architecture for performing Fast Fourier Transform (FFT) butterfly operations in two cycles.
A digital signal processor (DSP) is a special-purpose computer that is designed to optimize digital signal processing tasks such as Fast Fourier Transform (FFT) calculations, digital filters, image processing, and speech recognition. DSP applications are typically characterized by real-time operation, high interrupt rates, and intensive numeric computations. In addition, DSP applications tend to be intensive in memory access operations and require the input and output of large quantities of data.
Two advancements in DSP architecture, namely xe2x80x9cin-placexe2x80x9d memory management and the dual multiplier accumulator (dual-MAC), have led to increases in digital signal processing, efficiency, and speed. In order to reduce the amount of memory required for FFT calculations, an xe2x80x9cin-placexe2x80x9d memory management scheme may be employed whereby the FFT input data array is overwritten with the results of FFT calculations, thus eliminating the need for an additional memory array for storing the results at each stage of the FFT. The introduction of the dual-MAC, which is able to perform simultaneous multiplication and addition operations simultaneously, has also greatly enhanced DSP performance in many applications.
In order to implement xe2x80x9cin-placexe2x80x9d memory management in a dual-MAC DSP architecture, designers have typically either increased the number of cycles required to complete a series of operations or have introduced complex hardware solutions such as dual-port random access memory (RAM). Unfortunately, increasing the number of cycles reduces DSP performance, while the introduction of dual-port RAM greatly increases the DSP memory size, negating the memory efficiencies of xe2x80x9cin-placexe2x80x9d FFT. For these reasons dual-MAC DSP architectures generally perform FFT calculations without xe2x80x9cin-placexe2x80x9d memory management.
Other difficulties surrounding FFT implementation in a dual-MAC DSP architecture relate to DSP internal resources. While performing an FFT butterfly operation, intermediate results are generally stored in internal registers. Unfortunately, in order to allow subsequent operations to be performed, some intermediate results must be written to memory. Should the memory be in a read cycle, the intermediate results must wait until the next cycle to be written to memory, thus degrading performance unless special hardware such as dual-port RAM is used.
The present invention seeks to provide a DSP architecture that overcomes disadvantages of the prior art. A dual-MAC DSP architecture is provided that is capable of performing Fast Fourier Transform (FFT) butterfly operations in two cycles and without the need for specialized memory.
There is thus provided in accordance with a preferred embodiment of the present invention a digital signal processing (DSP) architecture including at least two multipliers where each multiplier is operative to receive either of a real and an imaginary first data value and either of a real and an imaginary coefficient value and multiply the data and coefficient values to provide a multiplication result, at least two three-input arithmetic logic units (ALU) where each ALU is operative to receive each of the multiplication results from the multipliers and either of a real and an imaginary second data value and perform any of addition and subtraction upon each of the multiplication results and the second data value to provide a Fast Fourier Transform (FFT) calculation result, at least two first-cycle registers where each first-cycle registers operative to receive the FFT calculation result from one of the ALUs calculated during a first processing cycle of two consecutive processing cycles, at least two second-cycle registers where each second-cycle register is operative to receive the FFT calculation result from one of the ALUs calculated during a second processing cycle of the two consecutive processing cycles, and multiplexing apparatus operative to selectably retrieve and forward for storage in memory the FFT calculation results from one of the first-cycle registers and one of the second-cycle registers during a first memory-write cycle of two consecutive memory write cycles and Be FFT calculation results from the other of the first-cycle registers and the other of the second-cycle registers during a second memory-write cycle of the two consecutive memory write cycles.
Further in accordance with a preferred embodiment of the present invention the apparatus further includes at least a first cosinusoidal register for receiving real cosinusoidal data input, at least a second cosinusoidal register for receiving imaginary cosinusoidal data input, and a multiplexer for selectably providing data from either of the cosinusoidal registers to either of the ALUs.
Still further in accordance with a preferred embodiment of the present invention the apparatus further includes rounding apparatus operative to concatenate a rounding constant to the multiplexed cosinusoidal data, thereby forming a low-ordered portion of concatenated input either of the ALUs.
There is also provided in accordance with a preferred embodiment of the present invention a digital signal processing (DSP) method including the steps of receiving at at least two multipliers either of a real and an imaginary first data value and either of a real and an imaginary coefficient value, multiplying at the two multipliers the data and coefficient values to provide a multiplication result, receiving at at least two three-input arithmetic logic units (ALU) each of the multiplication results from the multipliers and either of a real and an imaginary second data value, performing at the ALUs any of addition and subtraction operations upon each of the multiplication results and the second data value to provide a Fast Fourier Transform calculation result, receiving at at least two first-cycle registers the FFT calculation result from one of the ALUs calculated during a first processing cycle of two consecutive processing cycles, receiving at at least two second-cycle registers the FFT calculation result from one of the ALUs calculated during a second processing cycle of the two consecutive processing cycles, and selectably retrieving and forwarding for storage in memory the FFT calculation results from one of the first-cycle registers and one of the second-cycle registers during a first memory-write cycle of two consecutive memory write cycles and the FFT calculation results from the other of the first-cycle registers and the other of the second-cycle registers during a second memory-write cycle of the two consecutive memory write cycles.
Further in accordance with a preferred embodiment of the present invention the method further includes receiving real cosinusoidal data input at at least a first cosinusoidal register, receiving imaginary cosinusoidal data input at at least a second cosinusoidal register, and selectably providing data from either of the cosinusoidal registers to either of the ALUs.
Still further in accordance with a preferred embodiment of the present invention the method further includes concatenating a rounding constant to the multiplexed cosinusoidal data, thereby forming a low-ordered portion of concatenated input either of the ALUs.