The present invention relates to an instruction set and data paths of processors which perform fixed-point and floating-point multiply and add operations, and particularly processors which perform both multiply and add operations as a result of a single instruction.
A general-purpose processing system which performs multiply and add operations may allow these arithmetic operations to be performed at varying precision. High-precision operations generally consume greater circuit resources than low-precision operations. For example, in order to double the precision of a multiply operation, about four times as many circuits are required if the same performance is to be achieved.
A multiplier array which is capable of performing a multiply of two 64-bit operands, without reusing the array in sequential fashion, must generate the equivalent of 642, or 4096 bits of binary product (a 1-bit multiply is the same as a boolean or binary xe2x80x9candxe2x80x9d operation), and reduce the product bits in an array of binary adders which produces 128 bits of result. As a single binary adder (a full adder) takes in three inputs and produces two outputs, the number of binary adders required for such an array can be computed 642xe2x88x92128, or 3968.
There are well-known techniques for reducing the number of product bits, such as Booth encoding. There are also well-known techniques for performing the required add operations so as to minimize delay, such as the use of arrays of carry-save-adders. These techniques can reduce the size of multiplier arrays and reduce the delay of addition arrays, however, these techniques do not appreciably change the relation between the size of the operand and the size of the multiplier and adder arrays.
Using the same arithmetic as before, a multiply of 32-bit operands generates the equivalent of 322, or 1024 bits of binary product, and use the 322xe2x88x9264, or 960 full adders to generate a 64-bit product. This clearly is approximately one fourth the resources required for a multiply of 64-bit operands.
Because the product of 32-bit operands is 64-bits, while the product of 64-bit operands is 128-bits, one can perform two 32-bit multiples which produce 2 64-bit products, giving a 128-bit result. As such, because the 32-bit product uses one-fourth the resources of the 64-bit product, these two 32-bit products use one-half the resources of the 64-bit product. Continuing this computation, four 16-bit products use one-quarter of the 64-bit multiplier resources, eight 8-bit products use one-eighth of the resources, and so forth.
Thus, while this technique produces results with the same number of bits as the 64-bit product, decreasing the symbol size results in a proportionately decreasing utilization of the multiplier and adder array resources. Clearly, a design that has sufficient resources for a 64-bit multiply will be under-utilized for multiplies on smaller symbols.
Accordingly, there exits a need for a method, instruction set and system in which a set of multiplier and adder circuit resources may be employed in a manner that increases the utilization of these resources for performing several multiply and add operations at once as a result of executing an instruction, and which also permits the expansion of the multiplier and adder circuit resources to an even higher level so as to further increase overall performance.
The present invention relates to a method, instruction, and system which improves the utilization of a multiplier and adder array for performing multiply and add operations at a lower precision than the full word size of the processor and particularly the multiplier and adder array.
In accordance with an exemplary embodiment of the present invention, a novel group-multiply-and-sum instruction is performed wherein operands which are the word size of the processor, for example, 128-bits, are divided into symbols where the symbols are 64, 32, 16, 8, 4, 2, or 1 bit. Multiplier and multiplicand symbols are then multiplied together, and the products are added together so as to produce a single scalar result. The instruction performs twice as many multiplies as a group-multiply-and-add instruction (as described in related U.S. patent application Ser. No. 08/516,036, hereinafter referred to as the parent application) of the same symbol size. The instruction also avoids fixed-point overflows, because in the current example, the 128-bit result is large enough to hold the sum.
In another embodiment of the present invention, a novel group multiply-and-sum-and-add instruction is performed, wherein two operands are divided into symbols and then multiplied together. All the products resulting therefrom are then added together, along with a third operand value so as to produce a single scalar result. The instruction performs twice as many multiplies as a group-multiply-and-add instruction (as described in the parent application) of the same symbol size.
In another embodiment of the present invention, a novel group-complex-multiply instruction is performed, wherein the 64-bit multiplier and multiplicand operands are divided into symbols. Alternate symbols are taken to represent real parts (a and c) and imaginary parts (b and d) of a complex value, and a computation (a+bi)*(c+di)=(acxe2x88x92bd)+(bc+ad)i is performed. The instruction performs twice as many multiples as a group-multiply instruction (as described in the parent application) of the same symbol size, and in the current embodiment generates a result which is a 128-bit value.
In another embodiment of the present invention, a novel group-complex-multiply-and-add is performed, wherein two 64-bit operands are divided into complex-valued symbols, and a third 128-bit operand is divided into complex-valued symbols of twice the symbol size. The computation (a+bi)*(c+di)+(e+fi)=(ac-bd+e)+(bc+ad+f)i is performed. The result is a 128-bit value.
In yet another embodiment of the present invention, a novel group-convolve instruction is performed, wherein all but one symbol of a 128-bit value is multiplied with symbols of a 64-bit value. Certain of these products are summed together to form a 64-bit-by-64-bit slice of a convolution. The result is a 128-bit value.
As described in detail below, the present invention provides important advantages over the prior art. Most importantly, the present invention optimizes both system performance and overall power efficiency. The present invention performs a greater number of multiply operations and add operations in a single instruction without increasing the size of the result of this single instruction. The present invention arranges these operations in a manner which is advantageous both for implementation of digital signal processing algorithms, as the instructions perform these operations with greater parallelism and greater avoidance of arithmetic overflow, and which is advantageous for implementation of the multiplier itself, as these multipliers are formed from a partitioning of a single multiplier array, thereby overcoming significant disadvantages suffered by prior art devices as detailed above.
Additional advantages of the present invention will become apparent to those skilled in the art from the following detailed description of exemplary embodiments, which exemplify the best mode of carrying out the invention.
The invention itself, together with further objects and advantages, can be better understood by reference to the following detailed description and the accompanying drawings.