Field of the Invention
Embodiments of the present invention relate generally to computer processing and, more specifically, to performing multi-convolution operations in a parallel processing system.
Description of the Related Art
Convolutional Neural Networks (CNNs) are used to efficiently and reliably solve a wide range of classification problems. For example, CNNs are included in many image recognition, handwriting recognition, and speech translation algorithms. In operation, CNNs can substantially reduce error rates compared to many simpler machine learning techniques. However, the time required for CNNs to execute usually exceeds the time required for simpler machine learning techniques to execute. Consequently, time-sensitive applications may be structured to implement simpler machine learning techniques at the expense of producing inferior results.
The time required for a CNN to execute is dominated by the time required for the CNN to perform “multi-convolution” operations. A multi-convolution operation is a generalized form of a two-dimension convolution operation between an image and a filter. The multi-convolution operation is oftentimes implemented using a direct calculation method or using Fast Fourier Transforms (FFTs). While direct calculation techniques and FFT-based techniques may enable some multi-convolution operations to be implemented more efficiently, such techniques normally are unable to cause multi-convolution operations to execute efficiently over the wide range of dimensions and additional parameters associated with standard CNNs.
More specifically, a CNN typically includes multiple “convolution layers,” where each convolution layer performs convolution operations across four dimensions of an image batch and four dimensions of a filter stack. The four dimensions of the image batch include the image width, the image height, the number of color planes per image, and the number of images in the image batch. The four dimensions of the filter stack include the filter width, the filter height, the number of feature planes per filter, and the number of filters in the filter stack. Additional parameters may further customize the multi-convolution operations. For example, a horizontal filter stride and a vertical filter stride may reduce the overall computational load by decreasing the size of the subset of pixels involved in the convolution operation. Notably, the dimensions of the image batch and the filter batch as well as the additional parameters often vary between convolution layers.
Direct calculation techniques are typically tuned to optimize multi-convolution operations across a relatively small subset of dimensions and parameters. However, the performance of direct calculation techniques across other dimensions and parameters usually exceeds the time required to execute simpler machine learning techniques. Consequently, the time required to execute many CNNs using direct calculation techniques is typically unacceptably long. The time required to execute many CNNs using FFT-based approaches also varies dramatically based on the values of the parameters. In particular, if the horizontal stride or the vertical stride associated with a multi-convolution operation is greater than one, then the time required to execute the multi-convolution operation using FFT-based techniques may be prohibitively long.
In one approach to reducing the time required to execute CNNs across a wide range of parameter values, a convolution engine “unrolls” the multi-convolution operations by replacing the conventional processing of each convolution layer with matrix-based operations. In operation, the convolution engine converts the image stack into a column major image matrix and expresses the filter stack as a filter matrix. To reduce the performance degradation associated with fetching data from off-chip memory, the convolution engine stores the image matrix and the filter stack in on-chip memory. Subsequently, the convolution engine performs matrix multiplication operations between the image matrix and the filter stack. Notably, the dimensions of the image matrix and the filter matrix correlate to products of subsets of the independent parameters of the CNN instead of the individual parameters. Consequently, matrix-based techniques exhibit relatively uniform performance characteristics across the different input dimensions and parameters. Further, because many processing units include highly-tuned implementations of matrix multiplication functions, the time required to execute a CNN via the foregoing approach may be significantly less than the time required to execute the CNN using direct calculation or FFT-based techniques.
One drawback of matrix-based convolution engines is that, as part of converting the image stack to properly set up the matrix multiplication operations, the convolution engine has to copy the image data to multiple locations included in the image matrix. Consequently, the size of the image matrix may increase to the point where the available on-chip memory is completely consumed. For example, suppose that the image width were W, the image height were H, the number of color planes per image were C, and the number of images in the image batch were N. Further, suppose that the dimensions of each of the output images were (P×Q). In such a scenario, the dimensions of the image matrix would be (N×P×Q)×(C×R×S). Notably, for many applications, the memory required to store the image matrix may exceed the available on-chip memory. Consequently, those applications are relegated to implementing either less efficient CNN techniques or less accurate machine learning techniques.
As the foregoing illustrates, what is needed in the art is a more effective approach to performing multi-convolution operations.