1. Field of the Invention
This invention relates generally to the field of microprocessors and, more particularly, to vector multiplication and rounding within multiplication arithmetic units in microprocessors.
2. Description of the Related Art
Microprocessors are typically designed with a number of xe2x80x9cexecution unitsxe2x80x9d that are each optimized to perform a particular set of functions or instructions. For example, one or more execution units within a microprocessor may be optimized to perform memory accesses, i.e., load and store operations. Other execution units may be optimized to perform general arithmetic and logic functions, e.g., shifts and compares. Many microprocessors also have specialized execution units configured to perform more complex arithmetic operations such as multiplication and reciprocal operations. These specialized execution units typically comprise hardware that is optimized to perform one or more particular arithmetic functions. In the case of multiplication, the optimized hardware is typically referred to as a xe2x80x9cmultiplier.xe2x80x9d
In older microprocessors, multipliers were implemented using designs that conserved die space at the expense of arithmetic performance. Until recently, this was not a major problem because most applications, i.e., non-scientific applications such as word processors, did not frequently generate multiplication instructions. However, recent advances in computer technology and software are placing greater emphasis upon multiplier performance. For example, three dimensional computer graphics, rendering, and multimedia applications all rely heavily upon a microprocessor""s arithmetic capabilities, particularly multiplication and multiplication-related operations. As a result, in recent years microprocessor designers have favored performance-oriented designs that use more die space. Unfortunately, the increased die space needed for these high performance multipliers reduces the space available for other execution units within the microprocessor. Thus, a mechanism for increasing multiplier performance while conserving die space in needed.
The die space used by multipliers is of particular importance to microprocessor designers because many microprocessors, e.g., those configured to execute MMX(trademark) (multimedia extension) or 3D graphics instructions, may use more than one multiplier. MMX and 3D graphics instructions are often implemented as xe2x80x9cvectoredxe2x80x9d instructions. Vectored instructions have operands that are partitioned into separate sections, each of which is independently operated upon. For example, a vectored multiply instruction may operate upon a pair of 32-bit operands, each of which is partitioned into two 16-bit sections or four 8-bit sections. Upon execution of a vectored multiply instruction, corresponding sections of each operand are independently multiplied. FIG. 1 illustrates the differences between a scalar (i.e., non-vectored) multiplication and a vector multiplication. To quickly execute vectored multiply instructions, many microprocessors use a number of multipliers in parallel. In order to conserve die space, a mechanism for reducing the number of multipliers in a microprocessor is desirable. Furthermore, a mechanism for reducing the amount of support hardware (e.g., bus lines) that may be required for each multiplier is also desirable.
Another factor that may affect the number of multipliers used within a microprocessor is the microprocessor""s ability to operate upon multiple data types. Most microprocessors must support multiple data types. For example, x86 compatible microprocessors must execute instructions that are defined to operate upon an integer data type and instructions that are defined to operate upon floating point data types. Floating point data can represent numbers within a much larger range than integer data. For example, a 32-bit signed integer can represent the integers between xe2x88x92231 and 231xe2x88x921 (using two""s complement format). In contrast, a 32-bit (xe2x80x9csingle precisionxe2x80x9d) floating point number as defined by the Institute of Electrical and Electronic Engineers (IEEE) Standard 754 has a range (in normalized format) from 2xe2x88x92126 to 2127xc3x97(2xe2x88x922xe2x88x9223) in both positive and negative numbers. While both integer and floating point data types are capable of representing positive and negative values, integers are considered to be xe2x80x9csignedxe2x80x9d for multiplication purposes, while floating point numbers are considered to be xe2x80x9cunsigned.xe2x80x9d Integers are considered to be signed because they are stored in two""s complement representation.
Turning now to FIG. 2A, an exemplary format for an 8-bit integer 100 is shown. As illustrated in the figure, negative integers are represented using the two""s complement format 104. To negate an integer, all bits are inverted to obtain the one""s complement format 102. A constant of one is then added to the least significant bit (LSB).
Turning now to FIG. 2B, an exemplary format for a 32-bit (single precision) floating point number is shown. A floating point number is represented by a significand, an exponent and a sign bit. The base for the floating point number is raised to the power of the exponent and multiplied by the significand to arrive at the number represented. In microprocessors, base 2 is typically used. The significand comprises a number of bits used to represent the most significant digits of the number. Typically, the significand comprises one bit to the left of the radix point and the remaining bits to the right of the radix point. In order to save space, the bit to the left of the radix point, known as the integer bit, is not explicitly stored. Instead, it is implied in the format of the number. Additional information regarding floating point numbers and operations performed thereon may be obtained in IEEE Standard 754. Unlike the integer representation, two""s complement format is not typically used in the floating point representation. Instead, sign and magnitude form are used. Thus, only the sign bit is changed when converting from a positive value 106 to a negative value 108. For this reason, many microprocessors use two multipliers, i.e., one for signed values (two""s complement format) and another for unsigned values (sign and magnitude format). Thus, a mechanism for increasing floating point, integer, and vector multiplier performance while conserving die space is needed.
The problems outlined above are in large part solved by a multiplier configured in accordance with the present invention. In one embodiment, the multiplier may perform signed and unsigned scalar and vector multiplication using the same hardware. The multiplier may receive either signed or unsigned operands in either scalar or packed vector format and accordingly output a signed or unsigned result that is either a scalar or a vector quantity. Advantageously, this embodiment of the multiplier may reduce the total number of multipliers needed within a microprocessor because it may be shared by execution units and perform both scalar and vector multiplication. This space savings may in turn allow designers to optimize the multiplier for speed without fear of using too much die space.
In one embodiment, the multiplier comprises a partial product generator, a selection logic unit, and an adder. The multiplier may also comprise a multiplicand input configured to receive a multiplicand operand (signed or unsigned), a multiplier input configured to receive a multiplier operand (also signed or unsigned), and a sign-in input. The sign-in input is configured to receive a sign-in signal indicative of whether the multiplier is to perform signed or unsigned multiplication. The partial product generator, which is coupled to the multiplicand input, is configured to generate a plurality of partial products based upon the multiplicand operand. The selection logic unit, which is coupled to the partial product generator and the multiplier input, is configured to select a number of partial products from the partial product generator based upon the multiplier operand. The adder, which is coupled to the selection logic unit, is configured to sum the selected partial products to form a final product. The final product, which may be signed or unsigned, may then be output to other parts of the microprocessor.
In addition, the multiplier may further comprise an xe2x80x9ceffective signxe2x80x9d calculation unit. In one embodiment, the calculation unit may comprise a pair of AND gates, each configured to receive the most significant bit of one operand and the sign-in signal. The output of each AND is used as the effective sign for that gate""s operand. The effective sign may be appended to each operand for use as the operand""s sign during the multiplication process. Advantageously, the effective sign may allow both unsigned operands and signed operands to be multiplied on the same hardware.
A method for operating a multiplier within a microprocessor is also contemplated. In one embodiment, the method comprises receiving a multiplier operand, a multiplicand operand, and a sign-in signal from other functional units within the microprocessor. An effective sign bit for the multiplicand operand is generated from the sign-in signal and the most significant bit of the multiplicand operand. A plurality of partial products may then be calculated from the effective sign bit and the multiplicand operand. Next, a number of the partial products may be selected according to the multiplier operand. The partial products are then summed, and the results are output. In other embodiments, the steps may be performed in parallel or in a different order.
In another embodiment, the multiplier may be capable of multiplying one pair of N-bit operands or two pairs of N/2-bit operands simultaneously. The multiplier may comprise a multiplier input and a multiplicand input, each configured to receive an operand comprising one N-bit value or two N/2-bit values. The multiplier may also comprise a partial product generator coupled to the multiplicand input, wherein the partial product generator is configured to generate a plurality of partial products based upon the value of the multiplicand operand. The multiplier may further comprise a selection logic unit coupled to the partial product generator and the multiplier input. The selection logic unit may be configured to select a plurality of partial products from the partial product generator based upon the value of the multiplier operand. An adder may be coupled to the selection logic unit to receive and sum the selected partial products to form a final product comprising either one 2N-bit value or two N-bit values. The multiplier may receive a vector_in signal indicating whether vector or scalar multiplication is to be formed.
A method for operating a multiplier capable of scalar and vector multiplication is also contemplated. The method may comprise receiving a multiplier operand, a multiplicand operand, and a vector-in signal as inputs from functional units within the microprocessor and then calculating a number of partial products from the multiplicand operand using inverters and shifting logic. Certain partial products may be selected according to the multiplier operand. The selected partial products may then be summed to generate a final product. The final product may be in scalar form if the vector_in signal is unasserted, and in vector form if the vector_in signal is asserted.
In another embodiment, the multiplier may also be configured to calculate vector dot products and may comprise a multiplier input and a multiplicand input, each configured to receive a vector. A partial product generator may be coupled to the multiplicand input and may be configured to generate a plurality of partial products based upon one of the vectors. A first adder may be coupled to receive the partial products and sum them to generate vector component products for each pair of vector components. A second adder may be coupled to the first adder and may be configured to receive and sum the vector component products to form a sum value and a carry value. A third adder may be configured to receive the sum and carry values and one or more vector component products from the first adder. The third adder may be configured to output the sum of the sum and carry values (and any carry bits resulting from the summation of the one or more vector components) as a final result.
In yet another embodiment, the multiplier may be configured to output the results in segments or portions. This may advantageously reduce the amount of interface logic and the number of bus lines needed to support the multiplier. Furthermore, the segments or portions may be rounded. In this embodiment, the multiplier may comprise a multiplier input, a multiplicand input, and a partial product generator. The generator is coupled to the multiplicand input and is configured to generate one or more partial products. An adder, coupled to the partial product generator and the multiplier input, may be configured to receive a number of the partial products. The adder may sum the partial products together with rounding constants to form a plurality of vector component products which are logically divided into portions. One or more of the portions may be rounded.
In another embodiment, the multiplier may be configured to round its outputs in a number of different modes. Thus, an apparatus and method for rounding and normalizing results within a multiplier is also contemplated. In one embodiment, the apparatus comprises an adder configured to receive a plurality of redundant-form components. The adder is configured to sum the redundant-form components to generate a first non-redundant-form result. The adder may also be configured to generate a second non-redundant-form result comprising the sum of the redundant-form components plus a constant. Two shifters are configured to receive the results. Both shifters may be controlled by the most significant bits of the results they receive. A multiplexer may be coupled to receive the output from the shifters and select one of them for output based upon the least significant bits in the first non-redundant-form result. By generating more than version of the result (e.g., the result and the result plus a constant) in parallel, rounding may be accomplished in less time than previously required.
A multiplier configured to round and normalize products is also contemplated. In one embodiment, the multiplier may comprise two paths. Each path may comprise one or more adders, each configured to receive a redundant-form product and reduce it to a non-redundant form. The first path does so assuming no overflow will occur, while the second path does so assuming an overflow will occur. A multiplexer may be coupled to the outputs of the two paths, so as to select between the results from the first and second paths.
A method for rounding and normalizing results within a multiplier is also contemplated. In one embodiment, the method comprises multiplying a first operand and a second operand to form a plurality of redundant-form components. A rounding constant is generated and added to the redundant-form component in two different bit positions. The first position assumes an overflow will occur, while the second position assumes no overflow will occur. A particular set of bits are selected for output as the final result from either the first addition or the second addition.
An apparatus for rounding and normalizing a redundant-form value is also contemplated. In one embodiment, the apparatus may comprise two adders and a multiplexer. The first adder is configured to receive the redundant-form value and add a rounding constant to its guard bit position, thereby forming a first rounded result, wherein the guard bit position is selected assuming no overflow will occur. The second adder is similarly configured and performs the same addition assuming, however, that an overflow will occur. A multiplexer is configured to select either the first rounded result or the second rounded result based upon one or more of the most significant bits from the first and second rounded results. Performing the rounding in parallel may advantageously speed the process by allowing normalization to take place in parallel with the multiplexer""s selection.