Convolution operations in CNN for deep learning (DL) are computationally intensive and energy consuming. Convolution operations are one of the challenges in the proliferation of DL technology in commercial applications such as mobile devices and autonomous vehicles. Currently, there are published results demonstrating that performance of convolutions may be improved by using the Winograd theorem for convolutions for the purpose of training and as well as the inference of a CNN. However, these results do not address two important aspects of implementing CNN in commercial applications.
First, it is assumed that a transform of pre-trained kernels is carried out off-line and stored in a random access memory (RAM) for inference. Memory fetches of weights are slow and are much more energy consuming than arithmetic operations, which is why pruning and compressing of pre-trained weights are being researched in the DL community. In addition, a kernel transform may reduce the compression efficiency of the weights.
Second, the precision of the Winograd theorem for convolution is important for the implementation of CNN in a fixed-point format. The division of weights in kernel transforms, rounding, and rescaling may lead to a loss of arithmetic accuracy.
Thus, there is a need to address performance and accuracy associated with convolution.
A one-dimensional linear convolution may be expressed as y=g⊗d, where y is an output, g is a kernel, d is input data, and ⊗ is a convolution operation. CNN convolution kernels g may be small (e.g., 1×1, 3×3, 5×5, 7×7, etc.). The linear convolution is equivalent to y=B{((Gg)⊙(Ad))} according to the Winograd short convolution theorem, where B, G, and A are matrices, ⊙ is an element-wise multiplication operation. G is applied to the kernel g to perform a kernel transform, and A is applied to the data d to perform a data transform.
Current technologies may be based on the Cook-Toom algorithm or the Winograd algorithm. These algorithms reduce the number of multiplications by increasing the number of additions, because addition is computationally faster and less energy consuming than multiplication. Technologies based on the Cook-Toom algorithm and the Winograd algorithm are superior to direct convolution if weights are pre-calculated and stored for off-line inference use.
Arithmetic complexity may be further reduced for a two dimensional (2D) convolution expressed as y=B{(GgGT)⊙(AdAT)}BT, where T represents a matrix transpose operation. 2D convolution has recently been investigated and demonstrated to have advantages for CNN applications, where a kernel transform may be computed off-line and stored for later use in CNN inference applications. However, there are two drawbacks with such an implementation.
First, current state-of-art CNN models may have many layers and a large number of weights. Data fetched from stored weights may lead to high latency and energy consumption. Pre-computation and storing of transformed weights may work for a training stage that typically runs on a high-performance server but may be impractical for CNN inferences. In practice, weights are often pruned, quantized to eight or less bits in fixed-point, and compressed to save storage and reduce the number of data fetches. For example, in a two-dimensional F(4×4,3×3) convolution, an original kernel size is 3×3. However, after kernel transform, a new kernel to be stored is 6×6. Even though the number of computations is reduced, the kernel size quadrupled.
Second, a kernel transform of weights in fixed-point involves division, which may lead to a loss of arithmetic accuracy. For example, in an F(4×4,3×3) convolution, a kernel transform may require the division of the weights by 6, 12, and 24. In a fixed-point implementation, this leads to a loss of arithmetic accuracy.
The choice of polynomials used with the Winograd algorithm determines the elements in matrices A, B, and G. The Winograd algorithm requires that moduli be relatively co-prime with respect to each other. The degree, length, and coefficients of the polynomials determine the values of the elements of A, B, and G.
The Winograd algorithm includes a convolution expressed as polynomial s(x)=g(x)d(x) % M(x), where s(x) is an output, g(x) is a kernel, d(x) is input data, % denotes modulo reduction, and M(x) are polynomials.
Polynomials mi(x) are selected such that M(x)=Πi=0K-1mi(x), all of the mi(x) are relatively co-prime, and M(x) is a degree higher than s(x), where K is the number of the selected co-prime polynomials.
Kernels gi(x) are calculated such that gi(x)=g(x) % mi(x).
Data di(x) are calculated such that di(x)=d(x) % mi(x).
Outputs si(x) are calculated such that si(x)=g(x)d(x) % mi(x)=gi(x)di(x) % mi(x).
Output s(x) is calculated using the Chinese Remainder Theorem (CRT) such that s(x)=Σi=0K-1si(x)Ni(x)Mi(x) % M(x), where Mi(x)=M(x)/mi(x), Ni(x)Mi(x)+ni(x)mi(x)=1, and Ni(x) and Mi(x) are calculated using the extended Euclidean algorithm since Ni(x) and Mi(x) are co-prime and their Greatest Common Divisor (GCD) is 1.