Artificial neural networks are computing systems with an architecture based on biological neural networks. Artificial neural networks can be trained using training data to learn how to perform a certain task, such as identifying or classifying physical objects, activities, characters, etc., from images or videos. An artificial neural network, such as a deep neural network, may include multiple layers of processing nodes. Each processing node on a layer can perform computations on input data generated by processing nodes on the preceding layer to generate output data. For example, a processing node may perform a set of arithmetic operations such as multiplications and additions to generate an intermediate output, or perform post-processing operations on the intermediate output to generate a final output. An artificial neural network, such as a deep neural network, may include thousands or more of processing nodes and millions or more of parameters.
In general, a neural network may be developed, trained, and made available to many end users. The end users can then use the trained neural network to perform various tasks (which may be referred to as the inference process) with or without changing the existing network. When a neural network is being built, the top priority may be to get a working and accurate network. Thus, floating point numbers and floating point arithmetic are generally used during training to preserve accuracy. The training process can be performed on a computing system that has sufficient memory space and computation power, and, in many cases, may not require real-time performance and may be performed in hours, days, or months. The inference process, however, may be performed using pre-trained networks on many computing devices with limited memory space and computation power, such as mobile devices or embedded devices. Thus, in many cases, accessing memory that stores the large floating point data and/or performing floating point computation (which may cause a high power consumption) may become a bottleneck for the inference processes.