Recent multimedia CPUs operate in parallel on 128 b vectors, partitioned into 8-16 b elements. Exemplars of these designs are described in Craig Hansen. MicroUnity's MediaProcessor Architecture. IEEE Micro, 16(4):34-41, August 1996, and Keith Diefendorff. Pentium III=Pentium II+SSE. Microprocessor Report, 13(3):1, 6-11, March 1999. These designs perform arithmetic operations on values partitioned into vectors, such as addition and multiplication. The operations are performed by functional units in which the hardware employed to perform the operation, an adder for an add operation, or a multiplier for a multiply operation is in turn partitioned so as to perform vector operations of the specified element size. Vector adds need only AND carries between elements, but vector multiplies idle all but a single stripe one element wide through the product array. Thus, a 128 b×128 b multiplier, when performing a vector multiplication on 8 b operands, only employs the resources of an 8 b×128 b multiplier, leaving resources of the size of a 120 b×128 b multiplier idle, or performing a mathematically trivial operation, such as multiplication by zero.
As the hardware resources for a multiplier capable of performing a 128 b×128 b are considerably larger than that of a 128 b+128 b adder, the lower utilization of the multiplier when performing vector multiplications of a smaller element size is of considerable concern in designing an efficient processor. While one approach to designing an efficient multiplier is to limit the size of the multiplier to a smaller strip which can perform vector multiplications only of small elements in a single pipeline flow, the present invention instead aims to make efficient use of a large 128 b×128 b multiplier array pipeline by performing a vector-matrix product.