High dynamic range computing has become increasingly important in applications of various fields recently. Comparing with fixed-point numbers, floating-point numbers of an equivalent bit length have wider dynamic range and receive more attention. A bit length may refer to a given length of bits that are used to represent a data group hereinafter. As a result, floating-point number processors are often implemented as a part of a primary processor or as a coprocessor for the primary processor.
When a floating number is used to approximately represent a real number, the format of the floating-point number usually may include a sign bit S, an exponent field E and a mantissa field M, and can be expressed as {S, E, M}. The format of a floating-point number used in digital computing may be represented as {S, E, M}=(−1)S×(1+M)×2E, in which the base of the exponentiation is set to 2.
The accuracy of a floating-point number may be determined by a bit length of the mantissa field. That is, a floating-point number with a mantissa field of a greater bit length may be more accurate in representing a real number than a floating-point number with a mantissa field of a smaller bit length. Currently, general-purpose processors may be configured to perform addition, subtraction, multiplication, and division operations in compliance with the IEEE 754 standard for floating-point numbers. The format of a 32-bit floating-point number of IEEE754 standard may include one sign bit, eight bits for the exponent field, and twenty-three bits for the mantissa field, in which one or more integer bits of the mantissa field is implicit. The IEEE 754 standard further includes a 64-bit format that may provide a higher accuracy and a wider dynamic range.
Regardless of the formats of floating-point numbers, the implementation of floating-point numbers may result in more complicated computations. Thus, a solution that may speed up floating-point number computations while lowering power consumption has become a key issue in hardware design.
Conventionally, general-purpose processors are often configured to perform computations of floating-point numbers. However, general-purpose processors may be limited to computing floating-point numbers of a single format. One single format may be sufficient for some applications in terms of accuracy and dynamic range but it might not be necessary for other applications. For example, neural network algorithms may not require a high accuracy for floating-point numbers with an exponent field of a greater bit length but may prefer a high accuracy for floating-point numbers with a smaller exponent (weights or the weight variables and gradients in backward propagation algorithm). In the example of neural network algorithms, activation functions such as sigmoid function and tan h function are non-linear and, thus, input data with different levels of accuracy do not show significant difference after treated by activation functions. Therefore, the input data represented in floating-point numbers of high accuracy may cause unnecessary power consumption.