Deep learning algorithms are currently being implemented in various machine learning applications, such as audio/video recognition, video summarization, etc. These workloads currently operate on a variety of hardware platforms, including central processing units (CPUs), graphics processing units (GPUs) and fixed function hardware accelerators. These platforms typically perform various computations for Deep Learning Neural Network (DNN) topologies, with the most common computation involving DNN operations that include three-dimensional (3D) convolution and general matrix multiplication.
There have been recent efforts to enable these workloads to be computed with lower precision while maintaining acceptable accuracy limits. While converting floating point real numbers to lower precision, different quantization schemes may be followed. One popular scheme is based on representing each floating point number with an unsigned 8 bit integer through some transformation involving scaling and offset addition. Consequently, complex convolutional neural network (CNN)/general matrix-matrix multiplication (GEMM) operations on quantized inputs need to be calculated. These operations are typically computed in software via multiple pass throughs of data.