1. Field of the Invention
The present invention relates to instructions for matrix operations in a microprocessor. More specifically, the present invention relates to instructions for matrix operations that operate on 2 and 3 dimensional representations of matrix data in a matrix processing engine.
2. Description of the Related Art
A Single Instruction, Multiple Data (SIMD) matrix processor can efficiently perform matrix-based computations by breaking large matrices up into smaller sub-matrices. Unfortunately, memory hierarchies usually only support memory accesses of contiguous bytes (a vector), rather than the 2-dimensional structured access required for a sub-matrix. The block4 and block4v instructions of this invention perform simultaneous rearrangement of data in four matrix registers, transforming the data between vector and matrix representations. This allows efficient conversion between the in-memory representation of an arbitrary A×B matrix and the sub-matrix size supported by the matrix processor. These conversion operations also can be applied to more general data shuffling problems such as FFT address reversal.
Many communications and signal-processing algorithms are based upon matrix computations such as matrix-matrix multiplication. These computations are most efficiently performed by partitioning arbitrarily-sized matrices into fixed-size sub-matrices, and then performing the computations using those sub-matrices as basic computation units.
A matrix processor with 16 identical processing elements can be arranged in a 2-dimensional array that additionally matches the size of a 4×4 sub-matrix. The processing elements are connected with a row-and-column mesh network to directly perform matrix computations on the sub-matrices. Each processing element has a set of registers, with each register holding a corresponding element of a sub-matrix, which is based upon the position of the processing element in the row/column mesh. The individual register files, taken together, form a set of matrix registers, each holding an 4×4 sub-matrix.
In memory, an arbitrarily-sized A×B matrix comprises A rows of B contiguous elements (a vector of size B), with the address of each row beginning at an offset of B elements from the previous row. If this A×B matrix is partitioned into 4×4 sub-matrices, each sub-matrix comprises four rows of four contiguous elements (a vector of size 4), with the address of each row beginning at an offset of B elements from the previous row.
Since memory systems (including caches) are normally designed to transfer a contiguous set of bytes for each request, transferring a 4×4 sub-matrix directly between memory and a matrix register requires four independent memory operations. This either requires four sequential accesses, or a multi-ported memory that can handle four requests simultaneously.
To reduce the number of independent memory transfers and improve performance, this invention transfers four vectors of length 16 (4×4) between memory and the four matrix registers. This invention then simultaneously rearranges the vector data in those four matrix registers into four 4×4 sub-matrices using the block and or block4v instruction. The block4 and block4v instructions of this invention are found in the FASTMATH ADAPTIVE SIGNAL PROCESSOR matrix computing engine that is developed by Intrinsity, Inc., the assignee of this invention.