A MAC is a circuit structure that may be utilized by a microprocessor or other data processing device to sequentially multiply a string of binary numbers, and accumulate the resulting products. Although the microprocessor can usually perform such a multiply-accumulate operation with its own internal resources, it typically takes a relatively large number of instruction cycles to do so. By employing a MAC, the processor can multiply and accumulate the results at a much higher rate, depending upon the architecture of the MAC.
Prior art MACs are typically designed as general-purpose tools that can handle common data formats and widths. Such devices necessarily provide this flexibility at the cost of processing efficiency. A disadvantage in using these general-purpose MAC devices is that they are not optimized to the unique characteristics of specific data applications. For example, FIG. 1 illustrates one specific data processing architecture that can not be optimally handled by a general purpose MAC. FIG. 1 represents four lanes of a single instruction, multiple data (SIMD) processing architecture. In each lane 10, a multiplier 12 multiplies two 16-bit numbers. The multiplier 12 represents the multiplication product with a 40-bit number (32 bits for the maximum product, plus 8 extra bits for subsequent summation overflow). An accumulator 14 receives the four 40-bit products and produces a single 40-bit number 16 that represents the sum of the four 40-bit products. In other words, the accumulator 14 adds across the four SIMD lanes to produce a single sum as an output. Although not shown, the output of the accumulator 14 may propagate to a registered accumulator that adds the current output of the accumulator 14 to the registered result of the previous N samples of the accumulator 14, and then registers the result for the next iteration.
To increase processing efficiency, the 40-bit numbers in this example are represented in radix-16 format, such that a carry-save bit for every four operand bits provides arithmetic redundancy (the data fields are shown to be 40+c wide, where the “c” represents the carry-save bits that accompany the 40 operand bits). The increase in processing efficiency, however, comes at the cost of additional storage requirements due to the extra carry-save bits. Note that the carry-save bits are sparsely distributed across the full field of 40 operand bits, so that the carry-save word field is similar in size to the 40-bit operand. As used herein, the terms “sparse” and “sparsely distributed” refers to some number of information bits (e.g., carry bits) distributed across a data field of a given width, where the total number of information bits is less than the total number of possible bit positions within that given width. A prior art technique of adding the four 40-bit numbers is shown in block diagram form in FIG. 2. A first 4:2 compression adding block 18A receives the four 40-bit operands, and a second 4:2 compression adder block 18B receives the four sets of carry-save bits. Each of the first and second 4:2 blocks 18A and 18B produces a sum field and a carry field. A third 4:2 compression adder block 18C receives the two sum fields and the two carry fields from the first and second 4:2 blocks 18A and 18B.
One of the 4:2 compression adding blocks of FIG. 2 is shown in more detail in FIG. 3. As FIG. 3 shows, a 4:2 compression adder block is simply a cascade of two full adders, which are 3:2 compression adding blocks. The delay through a 4:2 compression adding block is therefore equal to twice the delay through a full adder, and the delay through the adder tree configuration shown in FIG. 2 is equal to four times the delay through a full adder. Since a pair of numbers encounters this delay for each iteration of the MAC, any reduction in the delay would be significant when compounded by a large number of iterations through the adder tree.