The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
Matrix multiplication is widely used in many practical applications across various industries. For example, in the field of machine learning, matrix multiplication is used for solving systems of linear equations, for batch training of neural networks, etc.
Referring to FIG. 1, first matrix 100 is multiplied with second matrix 102 to derive product matrix 104. For the sake of clarity and ease of explanation, each matrix is depicted as a square matrix. However, the embodiments disclosed herein are not limited to square matrices.
Matrix multiplication typically involves multiplying each row of a matrix with each column of another matrix. For example, elements 106 correspond to the first row of first matrix 100, and elements 108 correspond to the first column of second matrix 102. Values of elements 106 are multiplied with values of elements 108, and the products are accumulated to derive the value of an element having a position in the first row and first column of product matrix 104. In other words, (1×17)+(2×21)+(3×25)+(4×29)=250.
Similarly, the values of elements 106 are multiplied with values of elements 110, which correspond to the second column of second matrix 102, and the products are accumulated to derive the value of an element having a position in the first row and second column of product matrix 104. In other words, (1×18)+(2×22)+(3×26)+(4×30)=260.
The aforementioned process can be expressed using the following pseudocode:
/* A represents a first multiplicand matrix *//* B represents a second multiplicand matrix *//* C represents a product matrix *//* M represents the total number of rows in matrix A *//* N represents the total number of columns in matrix B *//* P represents the total number of columns in matrix A *//* P also represents the total number of rows in matrix B *//* iterate over each row of A */for (i = 0; i < M; i = i + 1) { /* iterate over each column of B */ for (j = 0; j < N; j = j + 1) {  /* iterate over each value in the current row of A and each value  in the */  /* current column of B */  for (k = 0; k < P; k = k + 1) {   /* compute a particular product matrix element based on   accumulating */   /* the product of the current value of A and the current value   of B */   C[i][j] += A[i][k] * B[k][j];  } }}
Notably, the pseudocode above involves three loops—an outer loop with two loops successively nested within it. Thus, the pseudocode employs O(n3) executions of a multiply-accumulate operation, where n is the number of elements in each matrix. As used herein, a multiply-accumulate operation, such as the operation in the innermost loop of the pseudocode above, is an operation that computes the product of two values and adds the product to the value in an accumulator register. Referring to FIG. 1, there are four rows in first matrix 100, there are four columns in second matrix 102, and there are four different combinations of elements for each row-column combination. Thus, there are 43 or 64 executions of the multiply-accumulate operation for computing C[i][j].
Other algorithms with lower complexity bounds exist. For example, Strassen's algorithm has a time complexity of O(n2.8). However, other algorithms are not as conducive to parallelization and/or require significant overheads when large matrices are involved.
In addition to the number of computations performed, the running time for matrix multiplication is also dependent on the memory bandwidth achieved when fetching matrix elements from relatively high latency memory, such as dynamic random-access memory (DRAM), into relatively faster memory, such as static random-access memory (SRAM) or register files, that feed the units performing the computations.
To optimize matrix multiplication, computations are typically performed concurrently with memory transfers such that the respective running times for computations and for memory transfers overlap. For example, when multiply-accumulate operations are being performed for one set of element values, another set of element values may be prefetched into a register file. However, since the time complexity of performing the multiply-accumulate operations is greater than the time complexity of performing memory transfers, matrix multiplication optimized in this way is compute-bound.
Some approaches for reducing the latency of performing multiple computations per cycle involve consuming a significant amount of additional power and are thus energy inefficient. Examples include using a fast processor clock and higher voltage, multiple execution units, and/or complex hardware logic to support dynamic and speculative instruction processing.
Some approaches involve achieving parallelism based on replicating units for performing the computations across multiple instances of the same instruction. Non-limiting examples of such an instruction include a single instruction multiple data (SIMD) instruction for a central processing unit (CPU) or a single instruction multiple thread (SIMT) instruction for a graphics processing unit (GPU). However, adding a full vector unit has the drawbacks of requiring a significant amount of additional power, requiring a new Instruction Set Architecture (ISA) to program the vector unit, and requiring additional hardware that occupies a significant amount of additional area.
Some approaches involve configurable hardware platforms, such as field-programmable gate arrays (FPGAs), systolic arrays, or specialized application-specific integrated circuits (ASICs), that are able to extract parallelism much more energy efficiently from hardware. However, such hardware platforms have a programming model that suffers in deployment due to hardness in programming. For example, they would require custom toolchain support for design compilers, synthesis and timing closure, and/or place and route.
Thus, what is needed is an approach that does not suffer from the drawbacks of the aforementioned approaches.
While each of the drawing figures depicts a particular embodiment for purposes of depicting a clear example, other embodiments may omit, add to, reorder, and/or modify any of the elements shown in the drawing figures. For purposes of depicting clear examples, one or more figures may be described with reference to one or more other figures, but using the particular arrangement depicted in the one or more other figures is not required in other embodiments.