1. Field of the Invention
This invention relates generally to the field of microprocessors and, more particularly, to floating point units configured to perform division operations.
2. Description of the Related Art
Microprocessors are typically designed with a number of xe2x80x9cexecution unitsxe2x80x9d that are each optimized to perform a particular set of functions or instructions. For example, one or more execution units within a microprocessor may be optimized to perform memory accesses, i.e., load and store operations. Other execution units may be optimized to perform general arithmetic and logic functions, e.g., shifts and compares. Many microprocessors also have specialized execution units configured to perform more complex arithmetic operations such as multiplication and reciprocal operations. These specialized execution units typically comprise hardware that is optimized to perform one or more particular arithmetic functions. In the case of multiplication, the optimized hardware is typically referred to as a xe2x80x9cmultiplier.xe2x80x9d
In older microprocessors, multipliers were implemented using designs that conserved die space at the expense of arithmetic performance. Until recently, this was not a major problem because most applications, i.e., non-scientific applications such as word processors, did not frequently generate multiplication instructions. However, recent advances in computer technology and software are placing greater emphasis upon multiplier performance. For example, three dimensional computer graphics, rendering, and multimedia applications all rely heavily upon a microprocessor""s arithmetic capabilities, particularly multiplication and multiplication-related operations. As a result, in recent years microprocessor designers have favored performance-oriented designs that use more die space. Unfortunately, the increased die space needed for these high performance multipliers reduces the space available for other execution units within the microprocessor. Thus, a mechanism for increasing multiplier performance while conserving die space in needed.
The die space used by multipliers is of particular importance to microprocessor designers because many microprocessors, e.g., those configured to execute MMX(trademark) (multimedia extension) or 3D graphics instructions, may use more than one multiplier. MMX and 3D graphics instructions are often implemented as xe2x80x9cvectoredxe2x80x9d instructions. Vectored instructions have operands that are partitioned into separate sections, each of which is independently operated upon. For example, a vectored multiply instruction may operate upon a pair of 32-bit operands, each of which is partitioned into two 16-bit sections or four 8-bit sections. Upon execution of a vectored multiply instruction, corresponding sections of each operand are independently multiplied. FIG. 1 illustrates the differences between a scalar (i.e., non-vectored) multiplication and a vector multiplication. To quickly execute vectored multiply instructions, many microprocessors use a number of multipliers in parallel.
Another factor that may affect the die space used by multipliers within a microprocessor is the microprocessor""s ability to operate upon multiple data types. Most microprocessors must support multiple data types. For example, x86 compatible microprocessors must execute instructions that are defined to operate upon an integer data type and instructions that are defined to operate upon floating point data types. Floating point data can represent numbers within a much larger range than integer data. For example, a 32-bit signed integer can represent the integers between xe2x88x92231 and 231xe2x88x921 (using two""s complement format). In contrast, a 32-bit (xe2x80x9csingle precisionxe2x80x9d) floating point number as defined by the Institute of Electrical and Electronic Engineers (IEEE) Standard 754 has a range (in normalized format) from 2xe2x88x92126 to 2127x(2xe2x88x922xe2x88x9223) in both positive and negative numbers. While both integer and floating point data types are capable of representing positive and negative values, integers are considered to be xe2x80x9csignedxe2x80x9d for multiplication purposes, while floating point numbers are considered to be xe2x80x9cunsigned.xe2x80x9d Integers are considered to be signed because they are stored in two""s complement representation.
Turning now to FIG. 2A, an exemplary format for an 8-bit integer 100 is shown. As illustrated in the figure, negative integers are represented using the two""s complement format 104. To negate an integer, all bits are inverted to obtain the one""s complement format 102. A constant of one is then added to the least significant bit (LSB).
Turning now to FIG. 2B, an exemplary format for a 32-bit (single precision) floating point number is shown. A floating point number is represented by a significand, an exponent and a sign bit. The base for the floating point number is raised to the power of the exponent and multiplied by the significand to arrive at the number represented. In microprocessors, base 2 is typically used. The significand comprises a number of bits used to represent the most significant digits of the number. Typically, the significand comprises one bit to the left of the radix point and the remaining bits to the right of the radix point. In order to save space, the bit to the left of the radix point, known as the integer bit, is not explicitly stored. Instead, it is implied in the format of the number. Additional information regarding floating point numbers and operations performed thereon may be obtained in IEEE Standard 754. Unlike the integer representation, two""s complement format is not typically used in the floating point representation. Instead, sign and magnitude form are used. Thus, only the sign bit is changed when converting from a positive value 106 to a negative value 108. For this reason, some microprocessors use two multipliers, i.e., one for signed values (two""s complement format) and another for unsigned values (sign and magnitude format). This places further constraints upon the die space used by each multiplier.
Another crucial factor that may affect the amount of die space allocated to a multiplier is the number of other functions that the multiplier is capable of executing. If a particular multiplier is capable of executing other types of instructions, e.g., division and square root functions, it may be allocated more die space because it alleviates the need for additional hardware, e.g., a dedicated division circuit.
For the reasons set forth above, a method for increasing multiplier performance and utility while conserving die space is needed.
The problems outlined above may in part be solved by a multiplier configured in accordance with the present invention. In one embodiment, the multiplier may be configured to execute divide-by-two operations and zero dividend operations using fewer multiplication iterations than normal division instructions. As used herein, normal division instructions are division instructions which do not have a zero dividend or integer power of two divisor. In another embodiment, the multiplier may also be configured to perform a back multiplication operation after multiplying the reciprocal of the divisor with the dividend.
In another embodiment, the multiplier is also configured to execute simple independent multiplication operations and complex iterative operations concurrently. The ability to perform iterative calculations advantageously allows the multiplier to perform calculations such as division and square root, thereby reducing die space constraints. The ability to concurrently execute these iterative instructions with multiplications instructions may improve the throughput of the multiplier while reducing the need for using more than one multiplier.
In one embodiment, the multiplier may comprise a plurality of pipeline stages, some of which are idle for particular clock cycles during the execution of a complex iterative operation. The multiplier may be configured to generate a control signal indicative of the occurrence of these idle clock cycles. The control signal may then by used to select and route independent simple multiplication instructions to the multiplier for execution during the idle clock cycles. In another embodiment, the multiplier may also be configured concurrently execute two independent complex iterative calculations. The multiplier""s availability during a particular clock cycle to perform a second instruction concurrently may be a function of the type of iterative operation being performed and the number of clock cycles between the particular clock cycle and the first clock cycle during which the multiplier began executing the first complex iterative operation. In some embodiments, the multiplier may be configured to store the intermediate products produced by the iterative calculations. Advantageously, some embodiments of the multiplier may be configured to compress these intermediate products before storing them, further conserving die space.
A method for executing independent multiplication and iterative instructions concurrently is also contemplated. In one embodiment the method comprises beginning execution of an iterative instruction in a pipelined multiplier, wherein the iterative instruction requires a first number of clock cycles to complete. A control signal is asserted during the first number of clock cycles if the multiplier will be available to begin execution of an independent multiplication instruction in a predetermined number of clock cycles. Upon detecting an asserted control signal, an independent multiplication instruction is dispatched to the multiplier. Execution of the independent multiplication instruction may begin and complete before the iterative instruction has completed executing.
In another embodiment, the multiplier may also be configured to perform signed and unsigned scalar and vector multiplication using the same hardware. The multiplier may also be configured to calculate vector dot products. The multiplier may receive either signed or unsigned operands in either scalar or packed vector format and accordingly output a signed or unsigned result that is either a scalar or a vector quantity. Advantageously, this embodiment may reduce the total number of multipliers needed within a microprocessor because it may be shared by execution units and perform both scalar and vector multiplication. This space savings may in turn allow designers to optimize the multiplier for speed without fear of using too much die space.
In yet another embodiment, the multiplier may be configured to output the results in segments or portions, which may be rounded. This may advantageously reduce the amount of interface logic and the number of bus lines that may be needed to support the multiplier.
In still another embodiment, the speed of the multiplier may be increased by configuring the multiplier to perform fast rounding and normalization. This may be accomplished configuring the multiplier to calculate two version of an operand, e.g., an overflow version and a non-overflow version, in parallel.