Multiply-accumulate (MAC) operations are required in processor cores. Multiply-accumulate is an important and expensive operation. It is frequently used in digital signal processing and video/graphics applications. As a result, any improvement in the delay for performing this operation has a significant impact on various aspects of processor performance.
FIG. 1 is a schematic diagram illustrating a method of implementing a multiply-accumulate (MAC) instruction with three source operands. Each 32 bit source operand, Operand 1, Operand 2 and Operand 3, is held in a respective register, Reg 1, Reg 2, Reg 3. The first and second operands are supplied to a 32×32 bit multiplier 2, the outputs of which (in the form of 64 pairs of sum/carry terms) are supplied to a set of full adders in block 4 for the purposes of accumulation. The third operand, Operand 3, is also supplied to the full adder block 4. The full adder block adds in the third operand the multiplication result. The outputs from the full adder block 4 are supplied to an add 64 block 6 which combines the terms from the full adder block and generates a final result. The final result is placed in a result register Reg R. According to this schematic, two of the three operands are multiplied together, and the third operand is added to the result of the multiplication. All three source operands are held in registers (generally implemented as flip-flops), and the final result is stored in a register (flip-flops). Three operands are required, they are all read from a register file at once, and the result is written back to the register file. This enlarges the register file and makes the wiring worse.
The cost of providing the third register Reg 3 for the third operand, with the corresponding flip-flop and wiring renders a unit such as that illustrated in FIG. 1 for implementing a MAC instruction unduly expensive. Moreover, when used for a multiply (MUL) operation, the third register is redundant.
FIG. 2 illustrates a multiply unit which is readily available in many processor cores. The multiply unit comprises first and second registers for holding respective operands, a 32×32 multiplier block 2, an add 64 block 6 and a result register Reg R. Like numerals denote like parts as in FIG. 1. Operation of such a multiply unit is known in the art.
There exists units which can implement MUL and MAC instructions, for example as described in the following.
United States Publication No. US2003/0172101 describes a multiply-accumulate unit (MAC) for use in a processor based system. The MAC includes control logic that allows it to implement multiply or multiply-accumulate instructions.
XP 001221317 (“A High Performance and Low-Power 32-bit Multiply-Accumulate Unit With Single-Instruction-Multiple-Data (SIMD) Feature”) describes a multiply-accumulate unit (MAC) that is capable of implementing multiply and multiply-accumulate instructions.
U.S. Pat. No. 5,436,860 describes a multiply-accumulate unit that wherein the multiply-accumulate circuitry is reused to provide barrel shifter functionality.