The present invention relates in general to numeric converters and, more particularly, to a programmable floating point to fixed point converter.
Many modern electronic systems perform computations on digital signals. Applications include neural networks and digital signal processing (DSP). For example, microprocessors commonly solve numeric problems such as finding the roots to a polynomial, computing a sum of products, or performing multiply and accumulate operations, all of which can involve multiplication, division, addition, and subtraction. Numbers are generally represented in either fixed point notation or floating point notation. Fixed point notation is typically binary with 2's complement for representation of negative numbers. Floating point numbers have a sign bit and an implied one plus a mantissa multiplied by a number (generally the number 2) raised to an exponent, i.e. (-1).sup.S .multidot.(1.M).multidot.2.sup.E. It is common to convert between floating point and fixed point notation. The mathematical operations can be performed on either the fixed point number or the floating point number.
In order to perform multiply and accumulate operations, first and second digital signals, say 16-bits each, are multiplied together to form a 32-bit product. The product term is added with the prior contents of an accumulator. Successive sets of digital signals are multiplied and accumulated in a similar manner. As the number of digital signals that are multiplied and accumulated increases, the size of the accumulator needed to stored the results increases accordingly. The accumulator can be made for example with a ripple-carry adder having 40 bits of width. However, the ripple carry becomes slower with more bits because each bit must be computed sequentially from least significant to most significant. A 40-bit wide accumulator also consumes a large amount of integrated circuit (IC) die area. Thus, there is a practical limit on the number of bits that can be allocated for accumulation.
Another option for the accumulator is to use a carry-select adder which computes the bits in parallel, typically in groups of four, with and without the carry. The actual carry bits control a group of multiplexers to select the correct result depending on whether or not a carry was present. Although the accumulation speed is greater in a carry-select adder because of the parallel processing, the size concerns on the IC die become an even greater problem with the additional multiplexers and redundant logic.