Field of the Invention
Embodiments of the present invention relate generally to computer processing and, more specifically, to indirectly accessing sample data to perform multi-convolution operations in a parallel processing system.
Description of the Related Art
Convolutional Neural Networks (CNNs) are oftentimes used to efficiently and reliably solve a wide range of inference problems. For example, CNNs are included in many image recognition, handwriting recognition, and speech translation algorithms. In operation, CNNs can substantially reduce error rates compared to many simpler machine learning techniques. However, the time required for CNNs to execute usually exceeds the time required for simpler machine learning techniques to execute. Consequently, time-sensitive applications may be structured to implement simpler machine learning techniques at the expense of producing inferior results.
As a general matter, the time required for a CNN to execute is dominated by the time required for the CNN to perform “multi-convolution” operations. A multi-convolution operation is a generalized form of a multi-dimension convolution operation between sample data, such as an image, and a filter. The multi-convolution operation is oftentimes implemented using a stencil-based technique or using Fast Fourier Transforms (FFTs). While stencil-based techniques and FFT-based techniques may enable some multi-convolution operations to be implemented more efficiently, such techniques are normally unable to allow multi-convolution operations to execute efficiently over the full range of dimensions and additional parameters typically associated with standard CNNs.
In this regard, a CNN typically includes multiple “convolution layers,” where each convolution layer performs convolution operations across multiple dimensions of a sample data batch and multiple dimensions of a filter stack. For example, for a four dimensional CNN involving image samples, the sample data batch is a batch of images, and the four dimensions of the image batch include the image width, the image height, the number of color planes per image, and the number of images in the image batch. The four dimensions of the filter stack include the filter width, the filter height, the number of feature planes per filter, and the number of filters in the filter stack. Additional parameters may further customize the multi-convolution operations. For example, a horizontal filter stride and a vertical filter stride may reduce the overall computational load by decreasing the size of the subset of pixels involved in the convolution operation. Notably, the dimensions of the image batch and the filter stack as well as the additional parameters often vary between convolution layers.
Stencil-based techniques are typically tuned to optimize multi-convolution operations across a relatively small subset of dimensions and parameters. However, the performance of stencil-based techniques across other dimensions and parameters usually exceeds the time required to execute simpler machine learning techniques. Consequently, as alluded to above, the time required to execute many CNNs using stencil-based techniques is typically unacceptably long. As also alluded to above, the time required to execute many CNNs using FFT-based approaches also varies dramatically based on the values of the parameters.
One approach to reducing the time required to execute CNNs across a wide range of parameter values incorporates the observation that convolution is a linear operator and therefore may be lowered onto matrix multiplication. Such an approach requires expanding the sample data into the required matrix form. More specifically, in such implementations, the convolution engine converts the image batch into a column-major image matrix and expresses the filter stack as a filter matrix. Subsequently, the convolution engine performs matrix multiplication operations between the image matrix and the filter stack. Notably, the dimensions of the image matrix and the filter matrix correspond to products of subsets of the independent parameters of the CNN instead of the individual parameters. As a result, matrix-based techniques exhibit relatively uniform performance characteristics across the different input dimensions and parameters. Further, because libraries of code written for each of many types of processing units include optimized matrix multiplication routines, the time required to execute a CNN via the foregoing approach may be significantly less than the time required to execute the CNN using stencil-based or FFT-based techniques.
One drawback to implementing such matrix-based operations in a convolution engine is that, as part of expanding the image batch to properly set up the matrix multiplication operations, the convolution engine has to copy the image data to multiple locations in the image matrix. Consequently, the size of the image matrix may increase to the point where the available memory is completely consumed. For example, suppose that the image width were W, the image height were H, the number of color planes per image were C, and the number of images in the image batch were N. Further, suppose that the dimensions of each of the output images were (P×Q). In such a scenario, the dimensions of the image matrix would be (N×P×Q)×(C×R×S). In many systems, the space needed to store image matrices of this size can exceed the available space in memory.
In an effort to reduce memory use while executing a multi-convolution via an optimized matrix multiplication routine, a tile-based convolution engine can be implemented that configures a parallel processing pipeline to independently expand and process individual tiles of the image matrix. In such an approach, the parallel processing pipeline performs address calculations to expand each tile of the image matrix in shared memory on an as-needed basis. The parallel processing pipeline then performs matrix multiplication operations between the image tile and the filter stack. Because the expanded image matrix is expanded directly into shared memory a tile at a time, the matrix is never stored in its entirety, and the amount of parallel processing memory used can be dramatically reduced compared to typical matrix-based convolution engines.
One drawback of tile-based convolution engines, however, is that calculating the address sequence needed to load the image data in the correct order to expand a tile of the expanded image matrix involves performing a sequence of dependent integer operations. This sequence of integer operations typically requires a relatively large number of clock cycles to execute. Oftentimes, the number of clock cycles required to perform the integer operations can exceed the number of clock cycles required to perform the matrix multiplication operations. As a result, the benefits of the optimized matrix multiplication routine are not fully realized and the overall time to execute CNNs may be unacceptably long.
More specifically, each loop iteration in a matrix multiplication is typically sized for a certain number of floating point math operations to cover the memory latency of the loads. For example, one implementation could have 100 math operations for 10 memory loads. Typically, those 10 memory loads execute relatively quickly and will return as the 100 math operations are finishing. However, if each such memory operation takes 10 extra integer operations, each dependent on the previous operation with a 10 cycle latency, then the cost to generate the 10 addresses is 100 cycles—matching the number of math operations before accounting for the memory latency to service those memory loads. If those memory loads take on average 10 cycles themselves, then we have now taken 200 cycles to load memory versus 100 cycles to calculate the floating point math operations, leading to 100 cycles in which no useful math is available to cover the memory latency, hurting overall efficiency.
As the foregoing illustrates, what is needed in the art is a more effective approach to performing multi-convolution operations.