Modern computer processors are fundamentally integrated circuits designed to complete a logical task. One task that processors are really good at implementing is performing arithmetic operations on numbers encoded in different formats (e.g., 8-bit integers, 32-bit integers, 32-bit floating-point values, etc.). However, most processors include logic for performing these arithmetic operations on scalar operands. For example, logic designed to perform an addition operation is designed to perform the operation using two distinct operands, each operand encoding a particular value to sum with the other operand. However, arithmetic operations are not limited to scalar values. In fact, many applications may utilize arithmetic operations on vector or matrix inputs. One example of an arithmetic operation on vectors is the dot product operation. While calculating dot products is common in these applications (e.g., physics), modern processors typically do not have the hardware designed into the circuit to perform these operations efficiently. Instead, the higher-level operation is reduced into a series of basic arithmetic operations using scalar values. For example, in the dot product operation, each vector operand includes a plurality of elements, and the dot product operation is performed by multiplying corresponding pairs of elements of the two input vectors to generate a plurality of partial products (i.e., intermediate results) and then summing the plurality of partial products. Each basic arithmetic operation can be performed in order using the hardware logic designed into the processor, and the intermediate results can be stored in a temporary memory store and re-used as the operand of another subsequent arithmetic operation.
Conventional processors include one or more cores, where each core may include an arithmetic logic unit (ALU) and/or a floating point unit for performing basic operations on integers and/or floating point values. Conventional floating-point units may be designed to implement a fused multiply accumulate (FMA) operation that multiplies two scalar operands and adds the intermediate result, along with an optional third scalar operand, to an accumulation register. A matrix multiply and accumulate (MMA) operation is the extension of the FMA operation for scalar values as applied to matrix operands. In other words, the MMA operation multiplies two matrices together and, optionally, adds the resulting intermediate matrix to a third matrix operand. Fundamentally, an MMA operation can be reduced into a number of basic dot product operations summed into an accumulation register. Furthermore, a dot product operation can be further reduced into a series of FMA operations on pairs of scalar operands.
Conventional processors can implement matrix operations by breaking down the MMA operation into a series of dot product operations and addition operations, and each dot product operation can be further broken down into a series of FMA instructions on corresponding elements of a pair of vectors. However, this technique is not very efficient as the MMA operation must be broken down into each of the basic arithmetic operations using scalar operands. Each basic arithmetic operation executed by the logic of the processor involves moving the scalar operands between the register file of the processor and the inputs to a datapath (i.e., the logic circuitry). However, the basic fundamental concept of the matrix operation is that the same elements of the matrix are re-used in multiple dot product operations (e.g., the same row of a first matrix is used to generate multiple dot products corresponding with multiple columns of a second matrix). If each basic arithmetic operation requires data to be loaded from the register file to the input of the datapath before the arithmetic operation is executed, then each element of data of the input operands may be loaded from the register file to the datapath many numbers of times, which is an inefficient use of the register file bandwidth. While there may be techniques to improving the efficiency of the processor (e.g., having register files with multiple banks such that operands can be efficiently stored in separate banks and multiple operands can be loaded from the register file into the inputs of the datapath in a single clock cycle), typically, a datapath is not designed specifically with matrix operations in mind. Thus, there is a need for addressing these issues and/or other issues associated with the prior art.