To maintain high computational precision, computers generally perform mathematical computations on numbers in either a floating-point format or in a fixed-point format with a large number of digits (for example, 32 or 64 bits). To avoid loss of precision in matrix computations, for example, computing devices commonly accumulate multiplication results in a 32-bit floating-point format, in which each number is represented by a 24-bit mantissa (including a sign bit) and an eight-bit exponent. This format creates a strain on memory resources and computational logic, particularly in large-scale, repeated computations, such as tensor computations that are used in training deep neural networks.
Some authors have suggested that numerical precision can be reduced in deep learning applications without causing a severe degradation of performance. (The term “numerical precision” is used in the present description and in the claims in its conventional sense, to refer to the number of bits that are used in representing a number.) For example, Gupta et al. describe a scheme for training deep networks using only 16-bit wide fixed-point numbers in “Deep Learning with Limited Numerical Precision,” published as arXiv preprint arXiv:1502.02551v1 (2015). The authors use a stochastic rounding technique in converting numbers to the lower-precision format and demonstrate that little or no degradation of classification accuracy is incurred when this technique is used in training computations.