A multiply-accumulate (MAC) operation is a common operation performed in signal processing and other algorithms. Because of its frequency of occurrence in such algorithms, many prior art microprocessor and digital signal processors (DSPs) include some form of direct instruction support for the multiply-accumulate operation. Typically, the CPU's instruction set includes a multiply-accumulate instruction or multiply and add instructions that, together, can execute a MAC operation in a single system clock cycle. These instructions are executed by hardware circuits such as separate multiplier and adder circuits, or a combined multiply-add circuit.
Algorithms that use MAC operations typically consist of a loop over many iterations. The algorithm's performance can be improved by executing the MAC operations of multiple loop iterations at once. This property has motivated CPU designers to include instructions that execute multiple MAC operations per system clock cycle. An instruction executing multiple MAC operations per system clock cycle may be implemented in a number of ways. For example, hardware may be provided to execute multiple MAC operations per cycle consisting of a number of multipliers and adders or a number of multiply-add circuits. By providing multiple arithmetic circuits, the CPU can execute the simultaneous multiplies and adds needed to support multiple MAC operations in parallel.
Microprocessor integrated circuits may include a plurality of multiplier-accumulator (MAC) units connected in parallel with each other. While this configuration provides the ability to perform multiple MAC operations within a single system clock cycle, it also consumes more real estate within the integrated circuit, and adversely affects the performance and power consumption of the integrated circuit due to the relatively long bus connections between multi-port memories, registers, and the multiple MAC units.
An example of a prior art CPU data path executing two MAC operations per cycle is depicted in FIG. 1. Each MAC unit defines a data path which consists of a register file comprised of sixteen, 40-bit registers, each having a multiplier and a load/store/arithmetic unit attached thereto. The multipliers each multiply two 16-bit operands to produce a 32-bit product. The multipliers can accept a new operand and produce a new product every system clock cycle, but have a latency of two system clock cycles. The load/store/arithmetic units can perform a 40-bit accumulate (i.e., addition/subtraction) in a single system clock cycle. The multiple MAC units are identical to each other, and provide an effective throughput of two multiply-accumulates per system clock cycle. Performing a complete multiply-accumulate operation requires passing the operands through a multiplier by issuing a multiply instruction, and then through a load/store/arithmetic unit by issuing an add instruction. The multiply and add instructions are scheduled for execution so that the product of the multiply operation is not used by the add operation until the multiplier has finished generating the product.
A prior art dual MAC data path is depicted in FIG. 1, and a timing diagram for that MAC is depicted in FIG. 2. The timing diagram depicted in FIG. 2 represents the timing for one of the components of the data path of FIG. 1, with the timing diagram for the other component of the data path being substantially similar. In operation, the first two multiply operands are read from a register file (REG FILE A) during Cycle 1 on signal lines DI_M1S1 and DI_M1S2. The values of these first operands are determined by the data stored at the corresponding register, addresses, e.g., register file A source 1 (REGS1A-1) and register file A source 2 (REGS2A-1). These first operands are communicated to multiplier M1, which begins a multiply operation on the two operands. In Cycle 2, a second set of operands is read from the register file (REGS1A-2 and REG1S2A-2) and communicated to the multiplier M1, which beings a multiply operation. At the same time, the multiplier M1 finishes its multiply operation on the first operands and generates a first output product PROD1-1 which is output on signal line PS_M1D. The first output product is communicated to register file A at the end of cycle 2. During Cycle 3, the first product that was generated, PROD1-1, is read from register file A on signal line PS_L1S1 and communicated to the load store arithmetic unit L1 as a first operand. The second operand to be accumulated by L1 is the value denoted ACC1-1 and is read from register file A on signal line PS_L1S2. The sum of the accumulation operation performed by L1 on PROD1-1 and ACC1-1, designated as SUM1-1, is written to register file A at the end of Cycle 3 over signal line RA_L1D. Also, during this cycle, a second product PROD1-2 is generated by the multiplier M1 and written to register file A. Similarly, third operands are read from register file A (REGS1A-3 and REGS2A-3) and communicated to the multiplier M1, which begins a multiply operation on the third operands. During cycles 4, 5 and 6, successive products are accumulated by L1 and additional products are generated by M1. When finished, the two mirror components of the prior art MAC data path have each accumulated the sum of an independent sequence of products. If the sum of those two sequences is needed, an additional accumulation instruction is issued to add the two sums.
It is common in CPU designs to increase the CPU clock frequency by processing instruction execution in a pipeline. The flow of instructions and their operands and results through the pipeline is controlled by the CPU's pipeline control logic. For CPUs that do not support a MAC operation, the duration of a pipeline stage (and therefore the clock frequency) is typically determined by the adder circuit or the delay to access memory. For CPUs that support MAC operations, the duration of a pipeline stage is often determined by the multiplier/adder/multiply-add circuit, i.e. by the hardware provided to perform the MAC operation. To overcome this limitation, prior art CPUs extend the pipeline by pipelining the multiplier/adder/multiply-add arithmetic circuits. Although the arithmetic circuits are pipelined with a fixed number of stages, pipelining still introduces significant complexity both in the design of the pipeline control logic and in writing a sequence of instructions to handle the latency of the pipeline. Ideally, the MAC operation should be executed with an arithmetic circuit that does not constrain the CPU's clock frequency and does not introduce complex latencies for the programmer to manage.
The prior art dual MAC data path has a number of disadvantages. Firstly, two multipliers and two adders are required. Secondly, the clock frequency of the dual MAC data path is restricted by the multiplier's delay; the multiplier already being pipelined once in an attempt to deal with its impact on the system frequency. However, this pipeline then requires extra circuit area, power and latency if the product is immediately re-used in a subsequent multiplication. Finally, the prior art dual MAC data path does not produce a single sum of all four products and the data path has to be partitioned into mirror components to reduce the pressure on register file ports and bus loading. However, this means that the data path does not directly sum a sequence of products in half the number of cycles, and an additional cycle is needed to add the final sums.
It is desirable to provide a MAC unit that overcomes the shortcomings of the prior art.