The present invention is directed to a multiplier-accumulator (MAC) and more particularly, to a virtual parallel multiplier-accumulator (VMAC) that processes more than or less than one MAC operations within a single system clock cycle.
A multiply-accumulate (MAC) operation is a common operation performed in signal processing and other algorithms. Because of its frequency of occurrence in such algorithms, many prior art microprocessor and digital signal processors (DSPs) include some form of direct instruction support for the multiply-accumulate operation. Typically, the CPU""s instruction set includes a multiply-accumulate instruction or multiply and add instructions that, together, can execute a MAC operation in a single system clock cycle. These instructions are executed by hardware circuits such as separate multiplier and adder circuits, or a combined multiply-add circuit.
Algorithms that use MAC operations typically consist of a loop over many iterations. The algorithm""s performance can be improved by executing the MAC operations of multiple loop iterations at once. This property has motivated CPU designers to include instructions that execute multiple MAC operations per system clock cycle. An instruction executing multiple MAC operations per system clock cycle may be implemented in a number of ways. For example, hardware may be provided to execute multiple MAC operations per cycle consisting of a number of multipliers and adders or a number of multiply-add circuits. By providing multiple arithmetic circuits, the CPU can execute the simultaneous multiplies and adds needed to support multiple MAC operations in parallel.
Microprocessor integrated circuits may include a plurality of multiplier-accumulator (MAC) units connected in parallel with each other. While this configuration provides the ability to perform multiple MAC operations within a single system clock cycle, it also consumes more real estate within the integrated circuit, and adversely affects the performance and power consumption of the integrated circuit due to the relatively long bus connections between multi-port memories, registers, and the multiple MAC units.
An example of a prior art CPU data path executing two MAC operations per cycle is depicted in FIG. 1. Each MAC unit defines a data path which consists of a register file comprised of sixteen, 40-bit registers, each having a multiplier and a load/store/arithmetic unit attached thereto. The multipliers each multiply two 16-bit operands to produce a 32-bit product. The multipliers can accept a new operand and produce a new product every system clock cycle, but have a latency of two system clock cycles. The load/store/arithmetic units can perform a 40-bit accumulate (i.e., addition/subtraction) in a single system clock cycle. The multiple MAC units are identical to each other, and provide an effective throughput of two multiply-accumulates per system clock cycle. Performing a complete multiply-accumulate operation requires passing the operands through a multiplier by issuing a multiply instruction, and then through a load/store/arithmetic unit by issuing an add instruction. The multiply and add instructions are scheduled for execution so that the product of the multiply operation is not used by the add operation until the multiplier has finished generating the product.
A prior art dual MAC data path is depicted in FIG. 1, and a timing diagram for that MAC is depicted in FIG. 2. The timing diagram depicted in FIG. 2 represents the timing for one of the components of the data path of FIG. 1, with the timing diagram for the other component of the data path being substantially similar. In operation, the first two multiply operands are read from a register file (REG FILE A) during Cycle 1 on signal lines DI_M1S1 and DI_M1S2. The values of these first operands are determined by the data stored at the corresponding register addresses, e.g., register file A source 1 (REGS1A-1) and register file A source 2 (REGS2A-1). These first operands are communicated to multiplier M1, which begins a multiply operation on the two operands. In Cycle 2, a second set of operands is read from the register file (REGS1A-2 and REGIS2A-2) and communicated to the multiplier M1, which beings a multiply operation. At the same time, the multiplier M1 finishes its multiply operation on the first operands and generates a first output product PROD1-1 which is output on signal line PS_M1D. The first output product is communicated to register file A at the end of cycle 2. During Cycle 3, the first product that was generated, PROD1-1, is read from register file A on signal line PS_L1S1 and communicated to the load store arithmetic unit L1 as a first operand. The second operand to be accumulated by L1 is the value denoted ACC1-1 and is read from register file A on signal line PS_L1S2. The sum of the accumulation operation performed by L1 on PROD1-1 and ACC1-1, designated as SUM1-1, is written to register file A at the end of Cycle 3 over signal line RA_L1D. Also, during this cycle, a second product PROD1-2 is generated by the multiplier M1 and written to register file A. Similarly, third operands are read from register file A (REGS1A-3 and REGS2A-3) and communicated to the multiplier M1, which begins a multiply operation on the third operands. During cycles 4, 5 and 6, successive products are accumulated by L1 and additional products are generated by M1. When finished, the two mirror components of the prior art MAC data path have each accumulated the sum of an independent sequence of products. If the sum of those two sequences is needed, an additional accumulation instruction is issued to add the two sums.
It is common in CPU designs to increase the CPU clock frequency by processing instruction execution in a pipeline. The flow of instructions and their operands and results through the pipeline is controlled by the CPU""s pipeline control logic. For CPUs that do not support a MAC operation, the duration of a pipeline stage (and therefore the clock frequency) is typically determined by the adder circuit or the delay to access memory. For CPUs that support MAC operations, the duration of a pipeline stage is often determined by the multiplier/adder/multiply-add circuit, i.e. by the hardware provided to perform the MAC operation. To overcome this limitation, prior art CPUs extend the pipeline by pipelining the multiplier/adder/multiply-add arithmetic circuits. Although the arithmetic circuits are pipelined with a fixed number of stages, pipelining still introduces significant complexity both in the design of the pipeline control logic and in writing a sequence of instructions to handle the latency of the pipeline. Ideally, the MAC operation should be executed with an arithmetic circuit that does not constrain the CPU""s clock frequency and does not introduce complex latencies for the programmer to manage.
The prior art dual MAC data path has a number of disadvantages. Firstly, two multipliers and two adders are required. Secondly, the clock frequency of the dual MAC data path is restricted by the multiplier""s delay; the multiplier already being pipelined once in an attempt to deal with its impact on the system frequency. However, this pipeline then requires extra circuit area, power and latency if the product is immediately re-used in a subsequent multiplication. Finally, the prior art dual MAC data path does not produce a single sum of all four products and the data-path has to be partitioned into mirror components to reduce the pressure on register file ports and bus loading. However, this means that the data path does not directly sum a sequence of products in half the number of cycles, and an additional cycle is needed to add the final sums.
It is desirable to provide a MAC unit that overcomes the shortcomings of the prior art.
The present invention is directed to a virtual parallel multiplier-accumulator (VMAC) that can process N MAC operations within M system clock cycles, where N may or may not be equal to M and where a MAC operation is generally defined by the equation (x)*(y)+(z). The present invention also reduces the physical size of integrated circuits and electronic devices since one VMAC constructed in accordance with the present invention replaces N prior art MAC units.
The VMAC of the present invention consists of a Control-Wave Generator (CWG) and a Sequential-Computational-Stage MAC (SCS-MAC) comprised of a plurality of sequentially (i.e., serially) arranged computational stages. The CWG produces multiple sets of consecutive control signals within a single VMAC clock (VMCK) cycle, and with each rising edge of VMCK. The frequency of the VMCK may be different from or the same as the system or main clock (MCK), as a matter of design choice. The control signals generated by the CWG control the flow of data or operands through the SCS-MAC (i.e., through the VMAC), and are also used to clock output or result registers that may be connected to the inventive VMAC. A source register may be connected to the VMAC to provide input operand data to the SCS-MAC. The SCS-MAC performs a MAC operation as the operand data propagates through each computational stage of the SCS-MAC. The output from the VMAC may be latched into an output or result register for communication to the source register of the VMAC or to another electronic device or circuit.
While, prior art MAC units accept a maximum of one set of operands per clock cycle, the VMAC of the present invention can accept a new set of operand data within a single clock cycle. In fact, the VMAC of the present invention permits many new MAC operations to start within a single clock cycle, and permits many operands to be present in the sequential computational stages, with each computational stage executing a different phase of a MAC operation (e.g., partial sums, products, etc.). Thus, the VMAC of the present invention simultaneously performs different phases of a MAC operation on different sets of operand data, and produces a MAC result per a time period which is approximately equivalent to the propagation delay through a single computational stage.
The present invention is directed to a virtual parallel multiplier-accumulator (VMAC) responsive to a VMAC clock (VMCK) derived from a master clock (MCK). The VMAC is adapted for performing more than or less than one multiplier-accumulator (MAC) operation within a MCK cycle.
The present invention is also directed to a virtual parallel multiplier-accumulator (VMAC) responsive to a VMAC clock (VMCK) derived from a master clock (MCK), where the VMAC is adapted for performing more than or less than one multiplier-accumulator (MAC) operation within a MCK cycle. The VMAC of this embodiment comprises a control-wave generator (CWG) adapted for generating a plurality of control signals within a VMCK cycle. The VMAC further comprises a sequential-computational stage MAC (SCS-MAC) adapted for receiving data from a source register and for receiving said plurality of control signals from the CWG. The SCS-MAC performs an operation on the data upon receipt of each of the plurality of control signals from the CWG.
The present invention is also directed to an integrated circuit including a virtual parallel multiplier-accumulator (VMAC) responsive to a VMAC clock (VMCK) derived from a master clock (MCK). The integrated circuit includes a VMAC that is adapted for performing more than one multiplier-accumulator (MAC) operation within a MCK cycle.
Other objects and features of the present invention will become apparent from the following detailed description, considered in conjunction with the accompanying drawing figures. It is to be understood, however, that the drawings, which are not to scale, are designed solely for the purpose of illustration and not as a definition of the limits of the invention, for which reference should be made to the appended claims.