In Digital Signal Processing (DSP), there is a need to perform numeric data processing at a very high rate. In many DSP applications, high data rate processing is achieved through the use of multiple Arithmetic Units (AU) such as combinations of adders, multipliers, accumulators, dividers, shifters, and multiplexors. However, there are two major difficulties with using many AUs in parallel: first, many control signals are needed to control many AUs; and, second, it is difficult to get the correct data words to each AU input on every clock cycle.
Some DSP architectures are Single Instruction Multiple Data (SIMD) architectures. A SIMD architecture, as defined here, groups its AUs into identical Processors, and these Processors perform the same operation on every clock cycle, except on different data. Hence, a Processor within a SIMD architecture can have multiple AUs, with each AU operating in bit parallel fashion (some definitions of SIMD architectures have AUs operating in bit serial fashion, which is not the case here). One application of a SIMD architecture is image compression, where an image is split into identically sized blocks. If there are four Processors, then four blocks within the image can be processed in parallel, and the Processors are controlled in parallel with the same instruction stream.
Many DSP applications, especially real-time applications, perform the same set of operations over and over. For example, in video compression, successive images are compressed continuously. Another example is in signal detection, where an input data stream in continuously monitored for the presence of a signal. Within these more complex functions, there are usually simpler functions such as convolution, Fourier transform, and correlation, to name only a few. These simpler functions can be viewed as subroutines of the complex functions. These simple functions can be broken down further into elementary subroutines; for example, the Discrete Fourier Transform (DFT) of an 8.times.8 data matrix can be implemented with 16 calls to an elementary subroutine which preforms an 8-point DFT. The present patent describes a SIMD architecture which efficiently performs a wide variety of elementary subroutines.
Many elementary subroutines of intrest can be described as a cascade of adds and multiplies, and architectures which are well matched to specific elementary subroutines can perform these elementary subroutines in a single pass. For example, Short Length Transforms (SLTs), such as the 8-point DFT, are described on pages 144 to 150 of H. J. Nussbaumer's book, Fast Fourier Transform and Convolution Algorithms (Second Edition), published by Springer-Verlag in 1990. In this book, SLTs are described as an add-multiply-accumulate-add process. That is, input data points are added and subtracted in various ways to form the first set of intermediate results; the first set of intermediate results are multiply-accumulated in various ways to form the second set of intermediate results; and the second set of intermediate results are added and subtracted in various ways to form the final results. The hardware for this three stage process of add-subtract, followed by multiply-accumulate, followed by add-subtract, is shown in FIG. 1. The architecture of FIG. 1 was utilized in the SETI DSP Engine design done at Stanford University in 1984, and is therefore, prior art for the present patent. The SETI DSP Engine architecture was published in four places: 1) "VLSI Processors for Signal Detection in SETI" by Duluk, et.al., 37th International Astronautical Congress in Innsbruck, Austria, Oct. 4-11, 1986; 2) "Artificial Signal Detectors" by Linscott, Duluk, Burr, and Peterson, International Astronomical Union, Colloquium No. 99, Lake Balaton, Hungary, June, 1987; 3) "Artificial Signal Detectors", by Linscott, Duluk, and Peterson, included in the book "Bioastronomy--The Next Steps", edited by Marx, Kluwer Academic Publishing, pages 319-335, 1988; and 4) "The MCSA II--A Broadband, High Resolution, 60 Mchannel Spectrometer", by Linscott, et. al. (including Duluk, the author of the present patent), 24th Annual Asilomar Conference on Circuits, Systems, and Conputers, November 1990.
The architecture of FIG. 1, which is prior art, can perform three arithmetic operations in parallel: two add-subtracts and one multiply-accumulate. The add-subtracts are performed in the Upper Adder 1 and the Lower Adder 3; while the multiply-accumulate is performed in the Middle Multiply-Accumulate 5. The parallelism of three simultaneous arithmetic operations is achieved through the use of multiport Random Access Memories (RAM), sometimes called multiport register files. The three multiport RAMs in FIG. 1 can perform ten simultaneous read or write operations. The Upper Four-port RAM 7 simultaneously performs: two read operations for the operands of the Upper Adder 1; one write operation for the Upper Adder 1 result; and one write operation for data from the Input/Output Bus 9. The Middle Two-port RAM 11 and the Lower Four-port RAM 13 perform similar functions.
The Processor architecture of FIG. 1, however, has limitations such as: i) requiring data to be multiplied by the value "one" to get from the output of the Upper Adder 1 to the input of the Lower Adder 3; ii) no provisions for directly manipulating complex numbers; iii) allowing only one multiply-accumulate per pass through the architecture, and hence, only one multiply-accumulate per elementary subroutine; iv) only one AU per stage; v) only one Input/Output Bus 9; and vi) it is highly specialized for SLT computaion. The Processor architecture of the present patent overcomes these limitations.