1. Field of the Invention
This invention relates generally to the field of microprocessors and, more particularly, to rounding the results of iterative calculations in microprocessors.
2. Description of the Related Art
Microprocessors are typically designed with a number of xe2x80x9cexecution unitsxe2x80x9d that are each optimized to perform a particular set of functions or instructions. For example, one or more execution units within a microprocessor may be optimized to perform memory accesses, i.e., load and store operations. Other execution units may be optimized to perform general arithmetic and logic functions, e.g., shifts and compares. Many microprocessors also have specialized execution units configured to perform more complex arithmetic operations such as multiplication and reciprocal operations. These specialized execution units typically comprise hardware that is optimized to perform one or more particular arithmetic functions. In the case of multiplication, the optimized hardware is typically referred to as a xe2x80x9cmultiplier.xe2x80x9d
In older microprocessors, multipliers were implemented using designs that conserved die space at the expense of arithmetic performance. Until recently, this was not a major problem because most applications, i.e., non-scientific applications such as word processors, did not frequently generate multiplication instructions. However, recent advances in computer technology and software are placing greater emphasis upon multiplier performance. For example, three dimensional computer graphics, rendering, and multimedia applications all rely heavily upon a microprocessor""s arithmetic capabilities, particularly multiplication and multiplication-related operations. As a result, in recent years microprocessor designers have favored performance-oriented designs that use more die space. Unfortunately, the increased die space needed for these high performance multipliers reduces the space available for other execution units within the microprocessor. Thus, a mechanism for increasing multiplier performance while conserving die space in needed.
The die space used by multipliers is of particular importance to microprocessor designers because many microprocessors, e.g., those configured to execute MMX(trademark) (multimedia extension) or 3D graphics instructions, may use more than one multiplier. MMX and 3D graphics instructions are often implemented as xe2x80x9cvectoredxe2x80x9d instructions. Vectored instructions have operands that are partitioned into separate sections, each of which is independently operated upon. For example, a vectored multiply instruction may operate upon a pair of 32-bit operands, each of which is partitioned into two 16-bit sections or four 8-bit sections. Upon execution of a vectored multiply instruction, corresponding sections of each operand are independently multiplied. FIG. 1 illustrates the differences between a scalar (i.e., non-vectored) multiplication and a vector multiplication. To quickly execute vectored multiply instructions, many microprocessors use a number of multipliers in parallel. In order to conserve die space, a mechanism for reducing the number of multipliers in a microprocessor is desirable. Furthermore, a mechanism for reducing the amount of support hardware (e.g., bus lines) that may be required for each multiplier is also desirable.
Another factor that may affect the number of multipliers used within a microprocessor is the microprocessor""s ability to operate upon multiple data types. Most microprocessors must support multiple data types. For example, x86 compatible microprocessors execute instructions that are defined to operate upon an integer data type and instructions that are defined to operate upon floating point data types. Floating point data can represent numbers within a much larger range than integer data. For example, a 32-bit signed integer can represent the integers between xe2x88x92231  and 231xe2x88x921 (using two""s complement format). In contrast, a 32-bit (xe2x80x9csingle precisionxe2x80x9d) floating point number as defined by tie Institute of Electrical and Electronic Engineers (IEEE) Standard 754 has a range (in normalized format) from 2xe2x88x92126 to 2127xc3x97(2xe2x88x922xe2x88x9223) in both positive and negative numbers. While both integer and floating point data types are capable of representing positive and negative values, integers are considered to be xe2x80x9csignedxe2x80x9d for multiplication purposes, while floating point numbers are considered to be xe2x80x9cunsigned.xe2x80x9d Integers are considered to be signed because they are stored in two""s complement representation.
Turning now to FIG. 2A, an exemplary format for an 8-bit integer 100 is shown. As illustrated in the figure, negative integers are represented using the two""s complement format 104. To negate an integer, all bits are inverted to obtain the one""s complement format 102. A constant of one is then added to the least significant bit (LSB).
Turning now to FIG. 2B, an exemplary format for a 32-bit (single precision) floating point number is shown. A floating point number is represented by a significand, an exponent and a sign bit. The base for the floating point number is raised to the power of the exponent and multiplied by the significand to arrive at the number represented. In microprocessors, base 2 is typically used. The significanad comprises a number of bits used to represent the most significant digits of the number. Typically, the significand comprises one bit to the left of the radix point and the remaining bits to the right of the radix point. In order to save space, the bit to the left of the radix point, known as the integer bit, is not explicitly stored. Instead, it is implied in the format of the number. Additional information regarding floating point numbers and operations performed thereon may be obtained in IEEE Standard 754. Unlike the integer representation, two""s complement format is not typically used in the floating point representation. Instead, sign and magnitude form are used. Thus, only the sign bit is changed when converting from a positive value 106 to a negative value 108. For this reason, many microprocessors use two multipliers, i.e., one for signed values (two""s complement format) and another for unsigned values (sign and magnitude format). Thus, a mechanism for increasing floating point, integer, and vector multiplier performance while conserving die space is needed.
The problems outlined above are in large part solved by a multiplier configured in accordance with the present invention. In one embodiment, the multiplier may perform signed and unsigned scalar and vector multiplication using the same hardware. The multiplier may receive either signed or unsigned operands in either scalar or packed vector format and accordingly output a signed or unsigned result that is either a scalar or a vector quantity. Advantageously, this embodiment may reduce the total number of multipliers needed within a microprocessor because it may be shared by execution units and perform both scalar and vector multiplication. This space savings may in turn allow designers to optimize the multiplier for speed without fear of using too much die space.
In another embodiment, speed may be increased by configuring the multiplier to perform fast rounding and normalization. This may be accomplished configuring the multiplier to calculate two version of an operand, e.g., an overflow version and a non-overflow version, in parallel and then select between the two versions.
In other embodiments, the multiplier may be further optimized to perform certain calculations such as evaluating constant powers of an operand (e.g., reciprocal or reciprocal square root operations). Iterative formulas may be used to recast these operations into multiplication operations. Iterative formulas generate intermediate products which are used in subsequent iterations to achieve greater accuracy. In some embodiments, the multiplier may be configured to store these intermediate products for future iterations. Advantageously, the multiplier may be configured to compress these intermediate products before storing them, which may further conserve die space.
In one embodiment, the multiplier may comprise a partial product generator, a selection logic unit, and an adder. The multiplier may also comprise a multiplicand input configured to receive a multiplicand operand (signed or unsigned), a multiplier input configured to receive a multiplier operand (also signed or unsigned), and a sign-in input. The sign-in input is configured to receive a sign-in signal indicative of whether the multiplier is to perform signed or unsigned multiplication. The partial product generator, which is coupled to the multiplicand input, is configured to generate a plurality of partial products based upon the multiplicand operand. The selection logic unit, which is coupled to the partial product generator and the multiplier input, is configured to select a number of partial products from the partial product generator based upon the multiplier operand. The adder, which is coupled to the selection logic unit, is configured to sum the selected partial products to form a final product. The final product, which may be signed or unsigned, may then be output to other parts of the microprocessor.
In addition, the multiplier may further comprise an xe2x80x9ceffective signxe2x80x9d calculation unit. In one embodiment, the calculation unit may comprise a pair of AND gates, each configured to receive the most significant bit of one operand and the sign-in signal. The output of each AND gate is used as the effective sign for that gate""s operand. The effective sign may be appended to each operand for use as the operand""s sign during the multiplication process. Advantageously, the effective sign may allow both unsigned operands and signed operands to be multiplied on the same hardware.
A method for operating a multiplier within a microprocessor is also contemplated. In one embodiment, the method comprises receiving a multiplier operand, a multiplicand operand, and a sign-in signal from other functional units within the microprocessor. An effective sign bit for the multiplicand operand is generated from the sign-in signal and the most significant bit of the multiplicand operand. A plurality of partial products may then be calculated from the effective sign bit and the multiplicand operand. Next, a number of the partial products may be selected according to the multiplier operand. The partial products are then summed, and the results are output. In other embodiments, the steps may be performed in parallel or in a different order.
In another embodiment, the multiplier may be capable of multiplying one pair of N-bit operands or two pairs of N/2-bit operands simultaneously. The multiplier may comprise a multiplier input and a multiplicand input, each configured to receive an operand comprising one N-bit value or two N/2-bit values. The multiplier may also comprise a partial product generator coupled to the multiplicand input, wherein the partial product generator is configured to generate a plurality of partial products based upon the value of the multiplicand operand. The multiplier may further comprise a selection logic unit coupled to the partial product generator and the multiplier input. The selection logic unit may be configured to select a plurality of partial products from the partial product generator based upon the value of the multiplier operand. An adder may be coupled to the selection logic unit to receive and sum the selected partial products to form a final product comprising either one 2N-bit value or two N-bit values. The multiplier may receive a vector_in signal indicating whether vector or scalar multiplication is to be formed.
A method for operating a multiplier capable of scalar and vector multiplication is also contemplated. The method may comprise receiving a multiplier operand, a multiplicand operand, and a vector-in signal as inputs from functional units within the microprocessor and then calculating a number of partial products from the multiplicand operand using inverters and shifting logic. Certain partial products may be selected according to the multiplier operand. The selected partial products may then be summed to generate a final product. The final product may be in scalar form if the vector_in signal is unasserted, and in vector form if the vector_in signal is asserted.
In another embodiment, the multiplier may also be configured to calculate vector dot products and may comprise a multiplier input and a multiplicand input, each configured to receive a vector. A partial product generator may be coupled to the multiplicand input and may be configured to generate a plurality of partial products based upon one of the vectors. A first adder may be coupled to receive the partial products and sum them to generate vector component products for each pair of vector components. A second adder may be coupled to the first adder and may be configured to receive and sum the vector component products to form a sum value and a carry value. A third adder may be configured to receive the sum and carry values and one or more vector component products from the first adder. The third adder may be configured to output the sum of the sum and carry values (and any carry bits resulting from the summation of the one or more vector components) as a final result.
In yet another embodiment, the multiplier may be configured to output the results in segments or portions. This may advantageously reduce the amount of interface logic and the number of bus lines needed to support the multiplier. Furthermore, the segments or portions may be rounded. In this embodiment, the multiplier may comprise a multiplier input, a multiplicand input, and a partial product generator. The generator is coupled to the multiplicand input and is configured to generate one or more partial products. An adder, coupled to the partial product generator and the multiplier input, may be configured to receive a number of the partial products. The adder may sum the partial products together with rounding constants to form a plurality of vector component products which are logically divided into portions. One or more of the portions may be rounded.
In another embodiment, the multiplier may be configured to round its outputs in a number of different modes. Thus, an apparatus and method for rounding and normalizing results within a multiplier is also contemplated. In one embodiment, the apparatus comprises an adder configured to receive a plurality of redundant-form components. The adder is configured to sum the redundant-form components to generate a first non-redundant-form result. The adder may also be configured to generate a second non-redundant-form result comprising the sum of the redundant-form components plus a constant. Two shifters are configured to receive the results. Both shifters may be controlled by the most significant bits of the results they receive. A multiplexer may be coupled to receive the output from the shifters and select one of them for output based upon the least significant bits in the first non-redundant-form result. By generating more than version of the result (e.g., the result and the result plus a constant) in parallel, rounding may be accomplished in less time than previously required.
A multiplier configured to round and normalize products is also contemplated. In one embodiment, the multiplier may comprise two paths. Each path may comprise one or more adders, each configured to receive a redundant-form product and reduce it to a non-redundant form. The first path does so assuming no overflow will occur, while the second path does so assuming an overflow will occur. A multiplexer may be coupled to the outputs of the two paths, so as to select between the results from the first and second paths.
A method for rounding and normalizing results within a multiplier is also contemplated. In one embodiment, the method comprises multiplying a first operand and a second operand to form a plurality of redundant-form components. A rounding constant is generated and added to the redundant-form component in two different bit positions. The first position assumes an overflow will occur, while the second position assumes no overflow will occur. A particular set of bits are selected for output as the final result from either the first addition or the second addition.
Also contemplated is an apparatus for evaluating a constant power of an operand using a multiplier. In one embodiment, the apparatus comprises an initial estimate generator configured to receive the operand and output an initial estimate of the operand raised to the desired constant power. A multiplier may be coupled to receive the operand and the initial estimate, wherein the multiplier is configured to calculate the product of the initial estimate and the operand. A first plurality of inverters may be coupled to receive, invert, and normalize selected bits from the product to form a first approximation, wherein the first approximation assumes an overflow has occurred in the multiplier. A second plurality of inverters may be coupled to receive, invert, and normalize selected bits from the product to form a second approximation, wherein the second approximation assumes an overflow has not occurred in the multiplier. A multiplexer may be configured to select either the first or second approximations for output.
Also contemplated is a method for evaluating a constant power of an operand using a multiplier. In one embodiment, the method comprises determining an initial estimate of the operand raised to a first constant power. The operand and the initial estimate are then multiplied in the multiplier to form a first product. A normalized first intermediate approximation is calculated by performing a bit-wise inversion on the first product assuming an overflow occurred during the multiplying. A normalized second intermediate approximation is calculated by performing a bit-wise inversion on the first product assuming no overflow occurred during the multiplying. Finally, a set of bits are selected from either the first intermediate approximation or the second intermediate approximation to form a selected approximation that may be output or used in subsequent iterations to achieve a more accurate result.
An apparatus for rounding and normalizing a redundant-form value is also contemplated. In one embodiment, the apparatus may comprise two adders and a multiplexer. The first adder is configured to receive the redundant-form value and add a rounding constant to its guard bit position, thereby forming a first rounded result, wherein the guard bit position is selected assuming no overflow will occur. The second adder is similarly configured and performs the same addition assuming, however, that an overflow will occur. A multiplexer is configured to select either the first rounded result or the second rounded result based upon one or more of the most significant bits from the first and second rounded results. Performing the rounding in parallel may advantageously speed the process by allowing normalization to take place in parallel with the multiplexer""s selection.
A method for operating a multiplier that compresses intermediate results is also contemplated. In one embodiment, this method comprises calculating an intermediate product to a predetermined number of bits of accuracy. Next, a signaling bit is selected from the intermediate product. The signaling bit is equal to each of the N most significant bits of the intermediate product. Next, the intermediate product is compressed by replacing the N most significant bits of the intermediate product with the signaling bit. The compressed intermediate product is then stored into a storage register. During the next iteration, the storage register is read to determine the value of the compressed intermediate product. The compressed intermediate product may be expanded to form an expanded intermediate product by padding the compressed intermediate product with copies of the signaling bit.
A multiplier configured to perform iterative calculations and to compress intermediate products is also contemplated. In one embodiment, the multiplier comprises a multiplier input, a multiplicand input, and a partial product generator as described in previous embodiments. The multiplier also comprises a partial product array adder which is configured to receive and add a selected plurality of partial products to form an intermediate product. Compression logic may be coupled to the partial product array adder. The compression logic may comprise a wire shifter configured to replace a predetermined number of bits of the intermediate product with a single signal bit, which represents the information stored in the predetermined number of bits. The signal bit is selected so that it equals the value of each individual bit within the predetermined number of bits. A storage register may be coupled to receive and store the compressed intermediate product from the compression logic.
In another embodiment, the multiplier may be configured to add an adjustment constant to increase the frequency of exactly rounded results. In such an embodiment, the multiplier may comprise a multiplier input configured to receive a multiplier operand, a multiplicand input configured to receive a multiplicand operand, a partial product generator, and selection logic. In one embodiment, the partial product generator is coupled to the multiplicand input and configured to generate one or more partial products based upon the multiplicand operand. The selection logic may be coupled to the partial product generator and the multiplier, wherein the selection logic is configured to select a plurality of partial products based upon the multiplier. The partial product array adder may be coupled to the selection logic, wherein the adder is configured to receive and sum a number of the partial products and an adjustment constant to form a product. The adjustment constant is selected to increase the frequency that the result is exactly rounded.
A method for increasing the frequency of exactly rounded results is also contemplated. In one embodiment, the method comprises receiving an operand and determining an initial estimate of the result of an iterative calculation using the operand. The initial estimate and the operand are multiplied to generate an intermediate result. The multiplication i s repeated a predetermined number of times, wherein the intermediate result is used in place of the initial estimate in subsequent iterations. The final repetition generates a final result, and an adjustment constant may be added to the final result, wherein the adjustment constant increases the probability that the final result will equal the exactly rounded result of the iterative calculation.