This invention relates to the instruction set architecture of a dual multiply-accumulator (MAC) based digital signal processor.
A digital signal processor (DSP) is a special-purpose CPU utilized for digital processing and analysis of signals from analogue sources, such as sound. The analog signals are converted into digital data and analyzed using various algorithms, such as Fast Fourier Transforms. DSPs are designed for particularly fast performance of certain operations, such as multiplication, multiplying the accumulating, and shifting and accumulating, because the math-intensive processing applications for DSPs rely heavily on such operations. For this reason, a DSP will typically include special hardware circuits to perform multiplication, accumulation and shifting operations.
One popular form of DSP architecture is known as a Multiply-Accumulate or MAC processor. The MAC processor implements an architecture that takes advantage of the fact that the most common data processing operations involve multiplying two values, then adding the resulting value to another and accumulating the result. These basic operations are efficiently carried out utilizing specially configured, high-speed multipliers and accumulators, hence the xe2x80x9cMultiply-Accumulatexe2x80x9d nomenclature. In order to increase the processing power of MAC processors, they have been designed to perform different processes concurrently. Towards this end, DSP architectures with plural MAC structures have been developed. For example, a dual MAC processor is capable of performing two independent MAC operations concurrently.
A conventional multiply-accumulator (MAC) has a 2-input multiplier M which stores its output in a product register P. The product register is connected to one input of a two-input adder A whose output is stored in one of several accumulator registers. A second input of the adder is connected to the accumulator array to allow for a continuous series of cumulative operations. Conventional vector processors are made of several MAC processors operating in parallel. Each MAC operates on its own independent data stream and the parallel MACs are joined only by a common set of accumulators. The number of instructions available for each individual MAC is fairly limited and thus, even when several MACs are combined in a parallel vector processor, the total number of MAC commands which must be encoded is relatively small.
The architecture of the newly designed dual-MAC processor shown in FIG. 1 differs from conventional parallel vector processors by the addition of the cross-connecting data lines. The dual-MAC architecture of FIG. 1 consists of two 32-bit input x and y (the 16-bit high and low data halves will be referred to as xh and yh, and xl and yl, respectively) which hold the operands to the two multipliers M0 and M1. The x and y registers are cross-connected to both of the multipliers so that each multiplier can operate on any two of the four possible input factors. The products p0 and p1 are accumulated with the contents of any of the accumulators a0 to a7 by the two adders A0 and A1. The p0 product is also cross-connected to the A1 adder, which is capable of 3-input addition. In the preferred embodiment, the dual-MAC processor is implemented in conjunction with an aligned double word memory architecture which can return two double words in a single 32-bit fetch.
This cross-connected dual-MAC architecture allows a single FIR or IIR digital filter applied to a single data stream to be processed by both MACs in parallel, two taps at a time, where each xe2x80x9ctapxe2x80x9d is a multiply-accumulate operation. Conventional vector processors with no interconnects can compute two FIR filters in parallel, but each filter is processed one tap at a time. Thus, for a single FIR or IIR filter, the cross-connected architecture operates twice as quickly as a conventional vector processor.
FIG. 2 is a list of all the possible instruction commands for the cross-connected dual-MAC architecture of FIG. 1. The commands are divided into accumulate statements and product statements. Each statement represents either a single operation done on one MAC side or the other, or two operations done on both sides in parallel. When these commands are implemented as part of architected instructions, each will either be an accumulate statement (add), a product statement (multiply), or a combination of addition and multiplication.
As shown in FIG. 2, there are 12 possible accumulate combinations and 20 possible product combinations in the orthogonal dual-MAC instruction set. Thus, the total number of commands which can be architected in the command processor and encoded within the commands is 12*20+12+20=272. The architected cross-connections result in a combinatorial multiplication of the number of possible functions which can be encoded as architected commands.
An issue which arises with this architecture is that encoding 272 separate dual-MAC operations within a command code requires 9 bits. It is advantageous to reduce the number of bits required to encode dual-MAC instructions without impacting available functionality. This is especially true when the number of bits available to encode commands is limited and other commands must also be encoded within the same limited number of bits. In the specific dual-MAC processor at issue, only 7 bits have been dedicated to encoding commands for dual-MAC operations.
According to the present invention, a near-orthogonal dual-MAC instruction set is provided which implements virtually the entire functionality of the orthogonal instruction set of 272 commands using only 65 commands. The reduced instruction set is achieved by eliminating instructions based on symmetry with respect to the result of the commands and by imposing simple restrictions related to items such as the order of data presentation by the programmer. Specific selections of commands are also determined by the double word aligned memory architecture which is associated with the dual-MAC architecture. The reduced instruction set architecture preserves the functionality and inherent parallelism of the command set and requires fewer command bits to implement than the full orthogonal set.