Recent multimedia CPUs operate in parallel on 128b vectors, partitioned into 8-16 b elements. Exemplars of these designs are described in Craig Hansen. Micro Unity's MediaProcessor Architecture. IEEE Micro, 16(4):34-41, August 1996, and Keith Diefendorff. Pentium III=Pentium II+SSE. Microprocessor Report, 13(3):1,6-11, March 1999. These designs perform arithmetic operations on values partitioned into vectors, such as addition and multiplication. The operations are performed by functional units in which the hardware employed to perform the operation, an adder for an add operation, or a multiplier for a multiply operation is in turn partitioned so as to perform vector operations of the specified element size. Vector adds need only AND carries between elements, but vector multiplies idle all but a single stripe one element wide through the product array. Thus, a 128b×128b multiplier, when performing a vector multiplication on 8 b operands, only employs the resources of an 8 b×128b multiplier, leaving resources of the size of a 120 b×128b multiplier idle, or performing a mathematically trivial operation, such as multiplication by zero.
As the hardware resources for a multiplier capable of performing a 128b×128b are considerably larger than that of a 128b+128b adder, the lower utilization of the multiplier when performing vector multiplications of a smaller element size is of considerable concern in designing an efficient processor. While one approach to designing an efficient multiplier is to limit the size of the multiplier to a smaller strip which can perform vector multiplications only of small elements in a single pipeline flow, the present invention instead aims to make efficient use of a large 128b×128b multiplier array pipeline by performing a vector-matrix product.