Matrix operations, such as matrix multiplication and convolutions, can be highly processor-intensive and memory-intensive operations, as they often involve complex operations on large, multi-dimensional matrix operands. Accordingly, the performance of complex matrix operations can be limited by the processing and/or memory latency. As matrix operations are increasingly utilized in a variety of applications and with ever-growing data sets (from graphics and image processing to machine learning and artificial intelligence), the demand for high-performance and flexible processing of matrix operations is increasing.