The present invention relates to performing integer multiply operations in computer systems. More specifically, the present invention relates to performing integer multiply operations using instructions that perform several smaller multiply operations in parallel.
In the art of computing, central processing units (CPUs) perform tasks by executing instructions that are part of an instruction set. Some of these instructions are dedicated to performing basic mathematical operations, including integer multiply operations.
The operations performed by instructions are implemented by logic gates on an integrated circuit (IC) die. The logic gates required to implement some operations, such as integer addition operations, tend to consume a relatively small area of the die. On the other hand, the logic gates required to implement integer multiply operations tend to consume a significantly larger area of the die. Accordingly, it is important to optimize the design of the circuits that perform integer multiply operations to minimize the die area consumed by these circuits.
CPUs typically have two types of functional units for performing mathematical operations. The first type of functional unit is the integer unit, which is responsible for performing integer (or alternatively, fixed-point) mathematical operations. The second type of functional unit is the floating-point unit, which is responsible for performing floating-point operations. The two functional units typically reside on distinct areas of the die, and each functional unit typically has access to its own register file. Separating the two functional units allows each unit to be optimized to perform the functions it supports.
Furthermore, there is typically little interaction between the integer and floating-point units, so there is little penalty incurred by separating the units.
Historically, integer multiplication has been considered important enough, from a performance perspective, to provide instructions in the instruction set that explicitly support integer multiply operations. However, integer multiplication has traditionally not been considered important enough to provide a full implementation of a 32-bit or 64bit integer multiplier in the integer unit, especially in reduced instruction set computer (RISC) CPUs. As discussed above, such an integer multiplier unit consumes a large area on the die, and this die area can typically be better used to provide other functions.
One prior art technique for supporting integer multiply instructions is to provide a smaller integer multiplier (such as an 8-bit or 16-bit multiplier) in the integer unit. The smaller multiplier computes sums of smaller products to produce a 32-bit or 64-bit results, and uses multiple cycles to compute the result. This approach has the advantage of consuming a relatively small area on the die. However, the smaller multiplier is nonetheless only useful for performing integer multiply operations, and is relatively slow. One CPU that uses this approach is the MIPS(copyright) R3000(copyright) RISC processor, which is a product of MIPS Technologies, Inc.
Another prior art technique is to use the floating-point unit to perform integer multiply operations. Typically this approach requires that a data path be provided between the integer register file and the floating-point register file. To perform an integer multiply operation, the operands are transferred from the integer register file to the floating-point register file via the data path, a multiplier in the floating-point unit is used to perform the integer multiply operation using operands from and storing the result to the floating-point register file, and the result is transferred from the floating-point register file back to the integer register file. This approach is used by CPUs adhering to the PA-RISC architecture, which are products of the Hewlett-Packard Company, and CPUs adhering to the IA-64 architecture, which are products of Intel Corporation. The IA-64 architecture was developed jointly by Hewlett-Packard Company and Intel Corporation.
This approach has the advantage of using existing multiplier circuits in the floating-point unit, so little extra area on the die is required. Furthermore, floating-point units typically include full multiplier implementations capable of performing 32-bit or 64-bit multiply operations in relatively few clock cycles. However, this approach also has several disadvantages. Since the integer and floating-point units are designed independently, each unit is optimized for its own operations and the data path between the two units is often not very fast. Another disadvantage is that floating-point registers, which could be used to perform other tasks, are needed for intermediate computation. Another disadvantage of using the floating-point unit is power. The floating-point unit typically uses a lot of power, and if a program does no real floating point work, many modern processors power down the floating-point unit. Thus, powering the floating-point unit up for an occasional integer multiply operation consumes significant power.
Code Segment A illustrates how an integer multiply operation is typically performed in a CPU adhering to the IA-64 architecture. In Code Segment A, the integers to be multiplied are stored in registers r32 and r33, and the result is placed in r34.
The instructions shown in Code Segment A are discussed in greater detail in the Intel(copyright)IA-64 Architecture Software Developer""s Manual, Volume 3: Instruction Set Reference, Revision 1.1, which was published in July of 2000 and is hereby incorporated by reference. Furthermore, the latencies associated with these instructions on an Itanium(trademark) CPU are discussed in the Itanium(trademark) Processor Microarchitecture Reference for Software Optimization, which was published in August 2000 and is hereby incorporated by reference. The Itanium(trademark) processor is the first CPU to adhere to the IA-64 architecture.
Returning to Code Segment A, at line 1 the instruction xe2x80x9csetf.sigxe2x80x9d is used to transfer the contents of general register 32 (r32) to the significand field of floating point register 6 (f6). Similarly, at line 2 the contents of r33 are transferred to the significand field of f7. The xe2x80x9csetf.sigxe2x80x9d instructions of lines 1 and 2 can be issued during the same clock cycle, and have a latency of nine cycles. Accordingly, if the xe2x80x9cxmpy.1xe2x80x9d instruction of line 3 is scheduled closer than nine cycles from the xe2x80x9csetf.sigxe2x80x9d instructions, the pipeline will delay execution of the xe2x80x9cxmpy.1xe2x80x9d instruction until nine cycles have elapsed.
At line 3, the instruction xe2x80x9cxmpy.1xe2x80x9d instruction treats the contents of the significand fields of f6 and f7 as signed integers, and multiplies the contents together to produce a full 128-bit signed result, with the least significant 64-bits of the result being stored in the significand field of f6. The xe2x80x9cxmpy.1xe2x80x9d instruction has a latency of eight cycles, so if the xe2x80x9cgetf.sigxe2x80x9d instruction of line 4 is scheduled closer than seven cycles from the xe2x80x9cxmpy.1xe2x80x9d instruction, the pipeline will delay execution of the xe2x80x9cgetf.sigxe2x80x9d instruction until seven cycles have elapsed.
Finally, the xe2x80x9cgetf.sigxe2x80x9d instruction of line 4 transfers the significand field of f6 to r34. The xe2x80x9cgetf.sigxe2x80x9d instruction has a latency of two cycles, after which the result of the multiply operation is available in r34.
Note that the integer multiply operation shown of Code Segment A has a total latency of 19 cycles, which is relatively slow. Although the integer multiply operation has a relatively long latency, many multiply operations can be pending in the pipeline, thereby allowing a multiplication result to be generated every few cycles.
This latency is not an issue for applications that perform many integer multiply operations in a sequence. In such applications, modulo scheduling allows the pipeline to be loaded with many multiply operations, thereby hiding the latency associated with any particular multiply operation. However, latency is an important issue for many other types of applications. For example, consider a database application that must perform a single integer multiply operation to calculate an index before data can be retrieved from an array in memory. In this example, the latency of the multiply operation is fully exposed and seriously impacts the speed at which data can be retrieved from the array in memory. Accordingly, the 19 cycle latency associated with integer multiply operations may prove to be a serious performance issue in Itanium(trademark)-optimized applications.
The present invention is a method and apparatus for performing integer multiply operations from data stored in the integer register file using multi-media primitive instructions that operate on smaller operands. Basically, the present invention performs 32-bit or 64-bit integer multiply operations using multi-media parallel multiply instructions that perform several 16-bit multiply operations in parallel, along with several other multi-media primitive instructions. By using the multi-media instructions of an Itanium(trademark) (or other IA-64 architecture) CPU, the present invention can perform a full 64-bit integer multiply operation with a latency of 11-14 clock cycles, and a full 32-bit integer multiply operation in 7-10 clock cycles. On an Itanium(trademark) CPU, the present invention provides a latency improvement of up to 58% for a full 32-bit integer multiply operation, and up to a 37% improvement for a full 64-bit integer multiply operation, compared to the prior art method illustrated in Code Segment A above. By using multi-media primitive instructions, operands do not need to be transferred to the floating-point unit and the results do not need to be retrieved from the floating point unit, thereby avoiding the combined 11-cycle latency of the xe2x80x9csetf.sigxe2x80x9d and xe2x80x9cgetf.sigxe2x80x9d instructions.
The present invention performs a multiply operation on a 32-bit or 64-bit value by performing multiply operations on a series of smaller operands to form partial products, and adding the partial products together. Data manipulation instructions are used to reposition 16-bit segments of the 32-bit operands into positions that allow the multi-media parallel multiply instructions to compute partial products, and the partial products are then added together to form the result.
Six embodiments of the present invention are disclosed. The first embodiment performs a 32-bit by 32-bit multiply operation that produces a 32-bit result The second embodiment performs an unsigned 32-bit by 32-bit multiply operation that produces a 64-bit unsigned result. The third embodiment performs a signed 32-bit by 32-bit multiply operation that produces a signed 64-bit result. The fourth embodiment performs a 64-bit by 64-bit multiply operation that produces a 64-bit result. The fifth and sixth embodiments mirror the functionality of the first and third embodiments, respectively, but are somewhat more efficient when the input operands are produced by an integer instruction in the immediately preceding cycle.
In every embodiment, the present invention achieves better latencies than the prior art method of performing integer multiply operations provided by the IA-64 architecture. Also, the prior art method and the method of the present invention are not mutually exclusive, and can be scheduled to execute concurrently. Therefore, the present invention has the ability to increase the integer multiplication bandwidth of an IA-64 CPU.
The present invention can also provide an improvement in power consumption, which is especially important in applications such as laptop computers. Typically, the floating-point unit uses a lot of power, and if a program does no real floating point work, IA-64 CPUs power down the floating-point unit. Therefore, powering the floating-point unit up for an occasional integer multiply operation consumes significant power. The present invention can perform a 32-bit or 64-bit multiply operation without powering up the floating-point unit.
The present invention does not use any circuits that are dedicated exclusively to 32-bit or 64-bit integer multiply operations. Since all of the circuits used by the present invention have other multi-media functions, these circuits are more generally useful and therefore provide a better balanced CPU design. Since minimizing die area is essential to achieving higher CPU clock frequencies, and therefore higher performance, it is always desirable to include as few circuits as possible, with each circuit providing as much functionality as possible.