Neural networks are computationally intensive applications. Some large scale neural networks, such as the VGG-16 convolutional neural network (CNN), require 30 Gflops to perform image classification of a single image. A large portion of the computation is devoted to multiply and accumulate operations. Multiply and accumulate operations are used in computing dot products and scalar products, for example.
Hardware accelerators have been used to reduce the computation time. Example hardware accelerators include application-specific integrated circuits (ASICs), filed programmable gate arrays (FPGAs), and special purpose processors such as graphics processing units (GPUs). Though the performance improvement provided by hardware accelerators is considerable, so too is the increase in power consumption and data bandwidth requirements. Weights and input activations are often stored as 32-bit single precision floating point values, and the hardware accelerators perform MAC operations on 32-bit operands.
A number of approaches have been proposed for reducing the computational requirements of neural networks. In some approaches, the number of bits used to represent the weights and input activations is reduced, which reduces both computational and bandwidth requirements. However, these approaches may require layer-specific architectures and/or specific training procedure. Some prior approaches may also perform poorly in complex classification tasks.