Matrix operations, such as matrix multiplication and convolutions, can be highly processor-intensive and memory-intensive operations, as they often involve complex operations on large, multi-dimensional matrix operands. Accordingly, the performance of complex matrix operations can be limited by processing and/or memory latency, and by the efficiency of algorithms used to implement the matrix operations. As matrix operations are increasingly utilized in a variety of applications and with ever-growing data sets (from graphics and image processing to machine learning and artificial intelligence), the demand for high-performance processing of matrix operations is increasing.