Because calculations with floating-point numbers can require significant computing power, some computers include special hardware for performing floating-point arithmetic called floating point processors (FPP), math coprocessors, etc. However, low-cost digital signal processors, microprocessors and microcontrollers such as those used in disk drives do not have floating-point processors, so when needed, floating-point arithmetic is performed using fixed bit registers with software routines providing the additional capability lacking in the hardware. Techniques for performing these floating-point calculations in an efficient way using the hardware and software tools that are readily available for embedded systems are desirable.
In theory floating-point numbers do not have a fixed number of digits before and after the decimal point, thereby allowing for the representation of very small or very large numbers. However, digital computer representations of floating-point numbers are usually approximations because a limited number of bits of memory are allocated to encode the numbers. The more coarse the approximation the faster calculations can proceed, but there is, of course, a loss of precision. Managing these tradeoffs is one of the design problems for many systems.
Single chip digital signal processors (DSPs) are specialized microprocessors designed for fast, real-time computations. One common feature of DSPs is the “multiply and/or accumulate” instruction, or MAC. This instruction multiplies two values and stores the result in the accumulator in a single clock cycle. Fixed-point processors use a modified form of integers for calculations. Numbers entered as real values are scaled using larger numbers and then rounded or truncated to an integer. The processor considers the scale value n (from number*2n) and uses this to determine the location of the fixed decimal point. For example, the number 1.75 could be represented as a 4-bit integer 7 (i.e. ‘0111’) with a scale of 2. The scale value of 2 means that the first two bits are for the value (and sign for 2's complement numbers) to the left of the decimal, the third bit represents “0.5” and the fourth bit to represents “0.25”. The scale value is a shift of the decimal point. A 4-bit number where the first 2 bits represent the integer portion and the second two represent the fraction is commonly referred to as a 2.2 format.
Other standard ways to represent numbers include representing floating point numbers as an “exponent”, “significand”, and “sign bit”. The encoding of a floating point number into a binary number can be done by normalizing the number by shifting the bits either left or right until the shifted result lies between 1 and 0.5 if the exponent is a power of 2. (If the exponent is a power of 16, the shifted result lies between 1 and 0.0625 ( 1/16).) A left-shift by one bit corresponds to multiplying by 2, and a right-shift corresponds to dividing by 2. The number of bit-positions shifted to normalize the number can be recorded as a signed integer. The negative of this integer (i.e., the number of bit-shifts required to recover the original number) can be defined as the exponent. Whether the right or left shift is assigned to the positive value is not significant. The normalized number between ½ and 1 is typically called the significand, because it contains the significant bits of the number. This floating point encoding is analogous to scientific notation for decimal numbers. The word mantissa is often used as a synonym for significand.
An IEEE standard defines “Fp32” as a single precision floating-point format in which a floating point number is represented by a sign bit, eight exponent bits, and 23 significand bits. The exponent is biased upward by 127 so that exponents in the range 2−126 to 2127 are represented using integers from 1 to 254. For “normal” numbers, the 23 significand bits are interpreted as the fractional portion of a 24-bit mantissa with an implied 1 as the integer portion.
U.S. Pat. No. 7,225,216 to Wyland (issued May 29, 2007) describes a floating point multiply-accumulator that uses “mantissa logic” for combining a mantissa portion of floating point inputs and “exponent logic” coupled to the “mantissa logic.” The exponent logic adjusts the combination of an exponent portion of the floating point inputs by a predetermined value to produce a shift amount and allows pipeline stages in the mantissa logic, wherein an unnormalized floating point result is produced from the mantissa logic on each clock cycle.
Published application 2006/0195497 by Dobbek, et al. (Aug. 31, 2006) describes a shift process for a digital signal processor for shifting an operand to either maximum or the minimum value depending on the bit of data input when saturation occurs. A saturation detection circuit is combined with an arithmetic shifter and a final decision multiplexor. The final decision multiplexor receives the output from the arithmetic shifter and the saturated value from the saturation circuit. When saturation is detected by the saturation detection circuit, the final decision multiplexor selects the saturate minimum or the saturate maximum depending on whether the most significant bit of the data in equals one or zero, respectively.
In published application 20060294175 Koob, et al. (Dec. 28, 2006) describe a method of counting leading zeros or ones in a data word in a digital signal processor. During operation, the execution unit can receive a data word that has a width of N bits. The execution unit can sign extend the data word to a wider temporary data word. The temporary data word can be input to a counter to count the leading zeros within the temporary data word to get a result.
In published application 0060200732 Dobbek, et al. (Sep. 7, 2006) describe a processor based nested form polynomial engine. An instruction causes a processor to set coefficient and data address pointers for evaluating a polynomial, to load loading a coefficient and data operand into a coefficient register and a data register, respectively, to multiply the contents of the coefficient register and data register to produce a product, to add a next coefficient operand to the product to produce a sum, to provide the sum to an accumulator and to repeat the loading, multiplying, adding and providing until evaluation of the polynomial is complete.
The development of computer programs for embedded systems involves the use of software programs such as compilers or assemblers that translate text instructions entered by the programmers into object code or machine language. In the following the term “compiler” will be used to encompass assemblers and similar programs. There are calculations and translations that can be made at “compile-time” while others can only be made at runtime. Because compile-time calculations are essentially done only once, efficiencies can be gained by performing calculations at compile-time rather than at runtime whenever possible.
There are many reasons to use vector and/or matrix operations in embedded control systems, but the large number of calculations required make efficiency critical. Examples of vector and or matrix operations include estimation of variables in mechanical systems that involve translating a model of state variables into differential equations well known in the art such asd(X)/dt=A*X+B*Uwhere X is a vector of states such as position, velocity and acceleration. A is a matrix which couples derivatives of states with the states, U is a vector of inputs such as current to a motor, and B is a matrix which couples inputs to the states. This state space can be converted from the form above into digital state spaceX[k+1]=Phi*X[k]+Gamma*U[k]where X[k+1] is the state prediction for next sample, X[k] is the state for the current position, phi is the digitized coupling matrix to map state into next state as function of only X[k] and gamma is the coupling matrix to map the forcing function from inputs to X[k]. This example is well known in the art as a digital state space predictor. Another application is filtering where the output of a filter is simply the dot product (vector*vector) of coefficients multiplied by a history of previous inputs and/or outputs. The well known equation for the canonical form of a filter is the output:Y[n]=cy1*Y[n−1]+cy2*Y[n−2]+ . . . cx0*X[n]+cx1*X[n−1]+cx2*X[n−2]where Y[n−1] is the previous output and Y[n−2] is the output prior to Y[n−1], etc. Similarly X[n] is the current input, and X[n−1] is previous input, etc. This operation can be built by the sum of two vector dot products of cy*Y+cx*X. The dot product is done using a circular vector so new data is written over the oldest data. The pointer is updated to point to the now newest data, and the multiplication always starts with the youngest data entry and (in order) goes through the rest of the data entries. The index pointer to data used in the multiplication will wrap to the physical start of the vector when the end of the vector is complete. After the filter operation the pointer is updated such that the new youngest pointer is now where the oldest data was stored. The need to perform this operation efficiently is one of the motivations for the invention described herein.