Multi-dimensional convolutions are a basic building block in many applications. For example, convolutional neural nets (CNNs) are being used increasingly in complex classification and recognition tasks, such as large-category image classification, object recognition, and automatic speech recognition.
In the convolutional layers of the CNN, a three-dimensional (3D) array of input data (commonly referred to as a 3D matrix or tensor) of dimensions M×N×D is convolved with a four-dimensional tensor made up of L kernels of dimensions j×k×D and stride S. Here M and N are the dimensions of the sampling space (also referred to as the X- and Y-dimensions), for example pixels of an image, while D (also referred to herein as the Z-dimension) is the number of input feature values given for each sample. Each 3D kernel is shifted in strides of size S across the input volume. Following each shift, every weight belonging to the 3D kernel is multiplied by each corresponding input element from the overlapping region of the 3D input array, and the products are summed to create an element of a 3D output array.
General-purpose processors are not capable of performing these computational tasks efficiently. For this reason, special-purpose hardware architectures have been proposed, with the aim of parallelizing the large numbers of matrix multiplications that are required by the CNN.