Deep learning technology is the core of artificial intelligence, and plays an important role in the promotion of many applications. A deep learning algorithm is a typical computing-intensive algorithm. As the core part of the algorithm, matrix multiplication is a computing- and data-intensive operation. In scenarios requiring high computational efficiency, a matrix algorithm generally needs to be executed by a dedicated FPGA- or ASIC-based processor. The dedicated processor can provide a large number of customized computing and storage resources. Using a reasonable computing element and storage structure in the part of the dedicated processor that is used to execute the matrix multiplication algorithm will greatly reduce the consumption of circuit resources and the design complexity, and improve the price performance ratio and the energy consumption ratio of a chip.
In a hardware architecture for executing a matrix multiplication algorithm in an existing dedicated processor, parallelism is generally mined in M and K dimensions when an M×N matrix and an N×K matrix are multiplied. However, because the multiplicand matrix of the matrix multiplication operation involved in the deep learning algorithm often has a small number of rows or even has only one row, mining parallelism in the M dimension easily leads to poor universality in architecture. If parallelism is mined only in the K dimension, the degree of parallelism is limited to the range of K in applications, limiting the computing performance, resulting in a low utilization ratio.