1. Field of the Invention
The present invention relates generally to digital signal processing systems and, more particularly, to an improved digital signal processor architecture.
2. Background of the Invention
Digital signal processing is characterized by operating on sets of data elements which are continuously evolving in time. These data sets correspond to the digital representation of signals in the analog domain, and are referred to as vectors. Digital signal processing algorithms are characterized by frequently performing the same computation on each of the elements in a vector. For example, a filtering algorithm may multiply each element in a vector by a different factor, and accumulate the partial results into a single final result.
Elementary signal processing algorithms, known as signal processing kernels, are characterized by the execution of sequences of operations on the vector elements. As stated above, one example is the execution of multiplication followed by the execution of accumulation. In regard to the execution of such sequence of operations, the state of the art in the implementation of digital signal processors includes either performing a sequence of operations using the same arithmetic/logic unit for as many times as the number of operations in the sequence (e.g., a multiplication operation in one cycle followed by an accumulation operation in the next cycle), or structuring the hardware as a pipeline in which operands enter at one end, with the operations being performed as the data flows through the pipeline, and results obtained at the other end of the pipeline (e.g., a multiply-add pipeline).
A significant limitation in the state of the art, in particular the pipeline approach mentioned above, is the restricted flexibility for performing the operations that compose the signal processing kernels, due to the conventional pipeline organization. The schemes in the state of the art do not allow for intermediate results to be collected for processing further down in the pipeline, perhaps in a different order in which the intermediate results are generated, or for changing the sequence of operations in the pipeline at some arbitrary points in time. These limitations require more complex sequences of instructions and usually require more execution cycles, thereby restricting the maximum performance that can be obtained from the functional units.
For the purposes of executing digital signal processing algorithms on a programmable processor, the vectors of data elements may be grouped into smaller subsets, for example, of four elements per subset, and computations can be performed simultaneously (in parallel) on all the elements of the subset. Two alternative schemes are currently used for grouping the data elements and specifying such operation.
In the first approach, the data elements in one subset of a vector are located in separate registers, and a different instruction specifies the operation performed on each of the elements. Although multiple instructions are used to specify the operations performed simultaneously on the data elements, all these instructions correspond to a single program flow, and thus are treated as a single entity. This approach is known as Very-Long Instruction Word (VLIW), referring to the case wherein a single very long instruction word contains a plurality of basic instructions. In the case of computations for a subset of a vector, all the basic instructions are identical because the same operation is performed on all the data elements; the only difference among these basic instructions is the location of operands and results. This approach is used in various digital signal processors, such as the C64 from Texas Instruments Inc., SC140 from StarCore, and ADSP 2116x from Analog Devices, Inc.
In the second approach, all the data elements in one subset of a vector are located in the same register (“wide register”), and a single instruction specifies the operation performed on all such elements. This approach is known as Single Instruction, Multiple Data (SIMD) with subword parallelism. The term SIMD refers to the use of only one instruction for all the operations, whereas the term subword parallelism refers to the concatenation of multiple data elements in the same register. This approach is used in various multimedia and signal processing-oriented microprocessor extensions, such as MMX from Intel Corporation and ALTIVEC from Motorola. Inc.
These approaches suffer from various limitations. In particular, the use of SIMD with subword parallelism requires placing the data elements in the right order within the wide registers, and further requires mechanisms to move the data elements around as needed. These requirements translate into additional hardware resources and execution cycles. On the other hand, the use of the VLIW approach requires coding multiple instructions that perform the same operation, thus leading to longer instructions (the so-called VLIWs), which tend to require more space in instruction memory.