Convolutional neural nets (CNNs) are being used increasingly in complex classification and recognition tasks, such as image classification, object recognition, and automatic speech recognition. Large-scale matrix multiplications are a key component in multi-dimensional tensor convolutions, which are the basic building block of the CNN. For this reason, special-purpose hardware architectures have been proposed for the purpose of parallelizing such matrix multiplications.
Multi-dimensional tensor convolutions are commonly decomposed into multiple outer product computations over pairs of two-dimensional matrices. The outer product (also known as the tensor product or Kronecker product) of two matrices A and B having elements {aij} and {bij}, respectively, is written as C=A⊗B, with elements
      c    ij    =            ∑              p        =        1            m        ⁢                  a        ip            ⁢                        b          pj                .            Typically, to compute each two-dimensional plane in an output tensor, multiple outer products of this sort are computed and summed together in a sequence of matrix multiply-accumulate operations of the form Cout=Cin+A⊗B.