One way to represent numerical values is in floating point format. Floating point numbers are widely used within the field of computing due to their ability to represent a wide range of values with relatively few bits. A floating point number may be represented using binary bits. A schematic illustration of a binary floating point number is shown in FIG. 1. The floating point number comprises a sign field S (101), mantissa field M (105) and exponent field E (103). The sign field comprises a single sign bit. If the floating point number is an unsigned number, it may not comprise a sign field. The exponent field E comprises a set of ew bits.
The mantissa has a bit width of mw bits and forms part of the significand. The significand is given by the value Y.M, where Y is the leading bit of the significand and can either be ‘1’ or ‘0’. When Y=1 the significand is referred to as ‘normal’ and when Y=0 the significand is referred to as denormal. For a given implementation, the value of the leading bit is typically fixed and can be referred to as an implicit bit. As such, only the mantissa M may be stored.
The sign bit denotes the sign of the floating point number. The exponent may either be a signed integer (typically in the range −2ew−1≤E≤2ew−1−1, where ew is the number of exponent bits, or exponent width), or an unsigned integer (typically in the range 0≤E≤2ew−1). If the exponent is an unsigned integer, it is biased by an exponent bias B to give the ‘true’ exponent that determines the value of the floating point number.
Thus, in general, the value represented by a floating point number is given by (−1)sign×2E−B×significand=(−1)sign×2E−B×Y.M
Biasing the exponent allows the ‘true’ exponent (i.e. E−B) to be a negative or a positive integer whilst limiting the value in the exponent field E to an integer greater than or equal to zero. This is advantageous because it enables the relative sizes of exponents for different floating point numbers to be compared more easily than if the value in the exponent field was a signed integer.
Another way to represent numerical values is in a fixed point representation. In a fixed point representation the radix point (the point between the integer and fractional bits of the number) is at a fixed position in the numeral. This is in contrast to a floating point number, in which the radix point can move, or “float”. A schematic illustration of a binary fixed point number is shown in FIG. 2. The fixed point number comprises an integer part 201 and a fractional part 203. The integer part and fractional part are separated by the radix point 205. The number of bits forming the integer part may be referred to as the integer width iw, and the number of bits forming the fractional part may be referred to as the fractional width fw. For a fixed point format with a given bitwidth, the number of bits used to represent the integer part (and therefore the fractional part) is fixed.
It is often desirable to convert a number between floating point and fixed point formats. For example, it may be less complex to perform certain arithmetic operations in hardware such as addition and subtraction on fixed point numbers due to the fixed position of the radix point. An example of a circuit for converting a floating point number to a fixed point number is shown in FIG. 3. The circuit 300 comprises a shifter 301 coupled to a subtraction unit 303 and a negation unit 305. The floating point number 100 to be converted is shown for the purposes of illustration.
The negation unit is configured to receive the sign bit of the floating point number as a first input, and the mantissa of the floating point number as a second input. The negation unit is configured to take the complement of the mantissa and output the result if the sign bit indicates the floating point number represents a negative value.
If the sign bit indicates that the floating point number represents a positive value, the negation unit does not take the complement of the mantissa and instead outputs the mantissa unaltered.
The subtraction unit receives as an input the exponent E of the floating point number and subtracts from that value the exponent bias B. The value of the bias may be fixed, or pre-determined, so that the subtraction unit is configured to subtract a fixed value from the exponent value. The value of the subtraction (E−B) is then output to the shifter 301.
The shifter 301 receives as a first input the mantissa M from the negation unit 305 and as a second input the value (E−B) from the subtraction unit 303. The shifter is operable to shift the mantissa left or right by a number of bits equal to the magnitude of the value of (E−B) to convert the floating point number to fixed point format. In particular, the shifter is configured to left-shift the mantissa by a number of bits equal to the value of (E−B) when (E−B) is greater than zero, and right-shift the mantissa by a number of bits equal to the value of (B−E) when the value of (E−B) is less than zero. The shifted mantissa is then output as the fixed point number. The output fixed point number may then be rounded. Examples of rounding modes include round-to-zero (RTZ), round-to-negative-infinity (RTN), round to positive infinity (RTP) and round-to-nearest-even (RNE).
The critical path of the binary circuit that determines the speed at which the circuit can convert a floating point number to a fixed point number may depend on the rounding mode and on whether the floating point number is signed or unsigned. For example, for an unsigned floating point number, the subtraction of the bias from the exponent is in the critical path of the circuit; for signed negative floating point numbers using RNI, the negation of the mantissa is in the critical path.