In DSP architectures, it is common to have a multiply-accumulate instruction which can accumulate multiple products without loss of accuracy. For example, if the DSP has 16-bit input operands, the product will usually be 32-bits, and the DSP accumulator may hold 40 or 48-bits so that a number of products can be summed without having to worry about overflow or carry out of the high-order accumulator bit position.
Accumulators long enough to accumulate a number of products are much less common in general purpose computer architectures. One reason for this is that most such general purpose computer architectures utilize multiple named registers instead of the single global accumulator.
In multi-precision or extended precision multiplication operations, it is often desirable to cumulate many products without a loss of accuracy. One typical operation is to multiply two 16-bit numbers to generate a single 32-bit number. Another typical operation is to multiply two 32-bit numbers and generate two 32-bit partial products: one instruction will generate the most significant 32-bits of the full product; and another instruction will generate the least significant 32-bits of the full product. In these cases, carries and overflows are potentially problematic since each addition can potentially result in a carry or an overflow. This situation is ameliorated by an add-extended operation, in which a carry-bit can be used and set in a single addition operation.
In Single-Input/Multiple-Data (SIMD) architectures such as MMX, and VMX, things are often much more complicated. For example, the MMX and VMX architectures have multiply-sum instructions which take two registers (V1, V2) each containing eight 16-bit elements and a register (V3) containing four 32-bit elements and returns a register (V4) which contains four 32-bit elements. Let VM:N represent each 16-bit (V1/V2) or 32-bit (V3/V4) element in the four registers, with "M" representing the register number, and "N" representing the element within the register, starting from the right (0). Such a multiply-sum instruction computes: ##EQU1##
This type of multiply/accumulate instruction can be very useful in certain applications, but there is an inherent difficulty in using the instruction if the full 16-bits of V1 and V2 are used since the sum of products V1:0*V2:0+V1:1*V2:1+V3:0 may overflow. Additionally, in multi-precision computations, a large number of products often need to be added or accumulated, and a SIMD sequence for doing this computation often requires three instructions: an add to produce a carry; an add modular for the result; and an add accumulating the carry. These three instructions cumulatively take three clock cycles on many architectures.
There are many reasons to perform multiple-precision arithmetic. One extremely important use for this type of arithmetic these days is in cryptography, such as in performing modular exponentials such as are required in RSA. In computing these modular exponentials, there is a need for many computations of the form A*B mod C, where C is fixed, and many different A and B are utilized. In common scenarios, A, B, and C are either 512-bit or 1024-bit numbers.
FIG. 1 is a diagram illustrating extended precision multiplication. An extended precision multiplier 12 consists of four 8-element groups. If each element is a 16-bit quantity, then each group is a 128-bit extended word, and the multiplier 12 is a 512-bit operand. The multiplier 12 is multiplied by one eight-element group representing a 128-bit multiplicand 14. Each element in the multiplier 12 and in the multiplicand 14 is labeled by an upper number indicating the group and a lower number representing the element within the group. The figure illustrates which elements are multiplied together to generate portions of the ultimate product 18. Sixteen sets of partial products 16 are shown with each partial product aligned according to its contribution to the entire product. Each partial product is labeled by a top row indicating the group number and element number of an element in the multiplier 12 and a bottom row indicating the group number and element number of an element in the multiplicand 14. The top, set of partial products, consists of the even numbered elements in the multiplier 12 multiplied by the first, low order element, of the multiplicand 14. Starting from the right, a least significant 32-bit partial product consists of the product of the multiplier 12 least significant element (00) multiplied by multiplicand 14 element (00). The next most significant 32-bit partial product consists of a 32-bit product of the second low order (02) multiplier 12 element multiplied by the low order multiplicand 14 element (00). This extends up through the high order multiplier 12 element (36) multiplied by the low order multiplicand 14 element (00). The next line down contains the odd-numbered elements from the multiplier 12 multiplied by the low order (00) multiplicand 14 offset by 16 bits to the left. Note should be made that this lines up with the second element in either the multiplicand 14 or the multiplier 12. The rightmost 32-bit is the 32-bit product of the (01) element of the multiplier 12 and the (00) element of the multiplicand 14. This extends up through the high order (37, 00) 32-bit product. Following this line, and equally indented, is the set of partial products formed by multiplying the even-numbered elements of the multiplier 12 by the second (01) multiplicand 14 element. This continues, with each pair of partial products being offset by 16-bits to the left over that of the previous lower order terms, until the final partial product which consists of products of the odd-numbered multiplier 12 elements multiplied by the high order (07) multiplicand 14 elements. The 16 partial products are then summed to generate the product 18. Since the multiplier 12 consists of eight 128-bit groups and the multiplicand 14 consists of one 128-bit group, the product 18 consists of five 128-bit groups. It can be seen, that the offset is provided in order to line up corresponding bits and elements for the summation.