In mathematics, a matrix is a 2-dimensional (2D) array of numerical elements. We extend this to any regular 2D array or collection of numbers or characters or ordered pairs or simply binary values. A generalization to a 3D or higher dimensional array or collection of binary values referred to as a matroid, is also included in this invention. This invention considers matroids to be stored as layers of 2D matrices in a stacked semiconductor device or alternately, stored as multiple 2D matrices along the third spatial dimension. A vector is a matrix with 1 column or with 1 row as is commonly understood.
Parallel processing of arithmetic vectors in SIMD (Single Instruction Multiple Data) paradigm has been prior art for several years now. That involves vectors of numbers stored in vector registers such that one or more of vector registers are used in a vector computation much like scalar numbers are used in a scalar computation. In prior art, a plurality of numbers may be stored as a vector in a register file or a memory such as those shown in FIGS. 1 and 2. They are read along one interface of bit-lines of the register-file and presented to several identical computing elements. In prior art, a matrix may be stored using multiple vector registers where each row of the matrix (row major) can be read at the interface as row data for computation. Alternately, in prior art, the matrix is stored as one column per vector register (column major) and read one matrix column length at a time for computation. In prior art, a matrix stored using its rows is not readable by its columns along its data interface. In such a case only individual row length of elements of the matrix can be directly accessed in any computation. Otherwise, a complicated transformation of the row major (or column major) matrix to its transpose is needed.
Prior art uses a register file or a multi-port RAM to store binary values or numbers or characters as operands for computation. In prior art, plurality of bits of a numerical value (i.e. a number) are stored in a single string of RAM cells forming a register in a register file. When accessed all the bits of the register are addressed using a word-line and are available at the same time. Vector values are stored in longer registers which store a plurality of scalar binary values that are accessible using a common address and are available at the same time. As an example, in prior art, a 128 bit vector register can be divided to hold 16 byte values or eight 16-bit short integers or four 32-bit integers or two 64-bit long integers
The vector register file or any register file in prior art uses a set of word-lines to access or address its individual registers and the values in the cells of the addressed register are read out on to an interface of bit-lines. This puts a limitation on the prior art that does not allow a column vector of elements of a register file to be read out in a single operation to perform computations directly on them collectively. This also means that a simple individual assembly language instruction cannot be used to directly access and process a column of a matrix or array that is stored as row vectors in a vector register file. Analogously, it means that it is not possible to use a set of simple assembly language instructions to directly access and process a row of numbers in a matrix stored by its columns in a vector register file. Such a limitation, for instance, does not allow direct computation of the product of a matrix with itself. Such a computation can be performed only after reading the matrix as a series of row vectors and then using several program steps to obtain the column vectors by extracting individual column elements from each row vector and then performing the multiplication. Alternately, it requires a transformation of the matrix or creation of a transposed copy of the matrix to carry out the product operation.
This invention describes a mechanism to eliminate the above mentioned limitations to store and access a matrix or array of numbers or binary valued words in a processing unit for performing computations that require accessing elements along both the rows and columns of the matrices or arrays.
Systolic Array and Tile Architectures (Prior Art)
Systolic array architectures are Multi-instruction Multi-data machines that use multiple processors arranged in a grid like fashion. Each processor element of the systolic array performs computation on a matrix element and forwards the result to the succeeding element. There are several ways computations proceed as listed in the references. All of these architectures utilize some kind or other of a compute-and-forward process, moving the result from one storage and compute element to another. The storage is distributed across all the computing elements. These architectures often need a crossbar or other network on a chip fabric. Systolic arrays are not simple memory arrays and hence they are less dense and consume more power than memories. Further, they use an MIMD programming paradigm or a data flow computing paradigm and may not be suitable for use inside regular processing units.
These limitations are addressed in the current invention as described below.