In recent years, many advances have been made in the field of integrated circuits intended for use in data processing systems. One such advance has been the incorporation of dedicated floating-point arithmetic logic onto the same integrated circuit as the general purpose microprocessor typically serving as the central processing unit (CPU) for a personal computer or workstation. Previous to such incorporation, specific floating-point logic was incorporated into microprocessor-based computers by way of a separate microprocessor, generally referred to as a co-processor. These so-called "math coprocessors" typically operated in a slaved fashion relative to the CPU, responding to instructions and operands forwarded thereto by the CPU. The incorporation of floating-point units (FPUs) on-chip with the microprocessor, primarily enabled through advances in manufacturing techniques, has provided improvement in overall system performance by providing access of other on-chip resources (register file, cache, etc.) to the FPU, and by way of the inherently improved ability to communicate operands and control signals with the on-chip FPU.
Multiple techniques are known for implementing on-chip FPUs into microprocessors. As one example, the well-known x86 architecture microprocessors typically treat floating-point units (whether on-chip or as a coprocessor) as a separate processor. Floating-point instructions have been executed in these architectures by the pipeline sending an entire floating-point instruction to the FPU for execution as a whole. Later x86-architecture microprocessors (e.g., the PENTIUM microprocessor available from Intel Corporation) include multiple pipelines, one of which is shared, up to and including one execution stage, by both integer and floating-point instructions. Microprocessors of the so-called "reduced instruction set", or RISC, type have implemented on-chip floating-point units differently from the complex instruction set (CISC) type microprocessors (e.g., of the x86-architecture), by implementing the FPU as an execution unit. RISC processors typically include a logic function known as a scheduler (also known as a completion unit) that controls each of several execution units, including integer arithmetic logic units (ALUs), load/store units, and the floating-point units, to execute single-cycle instructions; in this arrangement, the single-cycle RISC instructions are issued to and executed by the FPU in the same manner as integer instructions are issued to and executed by integer execution units.
In addition to the forwarding or scheduling of the floating-point instructions, each microprocessor having an on-chip FPU must deal with the problem of format conversion for the operands upon which the floating-point instructions are to executed. As is fundamental in the art, floating-point data words and operands generally include a sign bit, an exponent value (which may include a bias), and a mantissa. However, several different precision levels are available, including, for example, single precision, double precision, and extended precision as defined according to the IEEE 754 floating-point standard. The FPU, or other circuitry within the microprocessor, must of course be able to deal with the reformatting of operands or data words among these various precision formats, considering that operands of various formats may be combined in a floating-point instruction or operation.
However, even though floating-point data words may be represented in thirty-two bits for single precision, sixty-four bits for double precision, and eighty bits for IEEE extended precision data words, typical FPUs are constructed to operate only upon operands in the highest precision level, to save chip area and complexity, with rounding and other circuitry converting the result into the appropriate precision level desired by the program. In these and other situations, on-chip format conversion of the format of floating-point operands must be comprehended in the design of the microprocessor.
Furthermore, conventional x86-architecture microprocessors utilize thirty-two bit internal load/store buses to communicate data from main memory or from on-chip cache memory. This limited bus width for memory accesses requires floating-point data retrieved from memory to undergo format conversion prior to execution of the floating-point instruction. In conventional microprocessors, multiple machine cycles are required for this format conversion, as a first machine cycle is required to register the input operand or operands, followed by reformatting and storing the reformatted results in the next one or more machine cycles. The execution of each floating-point instruction using conventional FPU architecture therefore involves, in most cases, at least one additional machine cycle for the reformatting of floating-point operands retrieved from memory. This reformatting thus, of course, introduces an inefficiency into the implementation of an on-chip floating-point unit.