Many recent advances in the field of microprocessor design and manufacture have resulted in the improved performance of modern microprocessors. These advances include, of course, improvements in manufacturing technology that have enabled faster and smaller transistors and thus faster switching and higher levels of integration. Advances have also been made in microprocessor architecture that have resulted in improved efficiency in the processing of sequences of instructions. These architectural advances include such features as pipelined instruction execution, and superscalar architectures that issue instructions in parallel for execution.
Another architectural advance resulting in improved microprocessor performance is the implementation of floating-point units (FPUs) on the same integrated circuit with the general purpose microprocessor. An on-chip FPU provides local, dedicated, high-performance circuitry for performing complex floating-point arithmetic operations, reducing the burden on the integer arithmetic logic units of the microprocessor. Typical on-chip FPUs include their own pipeline for execution of floating-point arithmetic instructions, typically with multiple execution stages, considering that floating-point add and multiply operations generally require multiple machine cycles to execute.
Both internal communication of operands between the on-chip FPU and the integer processor core, and also external communication of floating-point operands, typically follow the well-known IEEE 754 standard for single-precision, double-precision, and extended-precision floating-point formats. In normalized form, the IEEE standard implies a leading "1", so that the mantissa of the floating-point data word corresponds only to the fractional portion of the mantissa, with the exponent adjusted accordingly. Numbers having values outside of the exponent ranges specified by the IEEE standard (for a particular precision) are referred to as "denormalized" operands. For example, a number smaller than the minimum limit of the IEEE standard range may be expressed in denormalized form; in this case, a leading "0" is implied so that the mantissa expresses a fractional value less than one. The presence of a denormalized operand is generally indicated by a flag in a floating-point status word.
As is also well known in the art, conventional on-chip FPUs operate according to a fixed internal format that is at least as precise as the highest precision IEEE format used by the integer processor core. Often, the internal FPU format will be of higher precision to maintain highly precise intermediate floating-point results in the execution of multiple-pass instructions. It is therefore possible that the result of a floating-point instruction may be a number which is of a value that requires format conversion prior to its communication to the integer core of the processor, such format conversion requiring additional operations by the FFU itself. If the FPU is of pipelined architecture, the floating-point instruction that generates such a result cannot be immediately completed because format conversion is necessary before writeback. In this case, successive instructions in the FPU pipeline, which are in intermediate stages of execution, may be completed before the preceding instruction. This out-of-order completion may cause certain conflicts in the operation of the FPU, as will now be described.
Referring now to FIG. 1, an example of a conventional FPU pipeline is illustrated. This FPU pipeline may be fully performed by the on-chip FPU, or may partially be performed by the integer pipelines as in the case of the PENTIUM microprocessor available from Intel Corporation. Queue stage 200 initiates the floating-point pipeline, followed by decode stage 202 and issue stage 204. Three execution stages 206, 208, 210 execute, in pipelined fashion, floating-point instructions that have been issued from issue stage 204. In the example of FIG. 1, three instructions n, n+1, n+2 are presently being executed in stages 206, 208, 210, with instruction n being the earliest instruction in the sequence; upon completion of an instruction, the results are forwarded to a writeback stage and a floating-point status word is updated. The writeback stage presents the FPU results to the integer portion of the microprocessor, in program order.
Conflicts may be presented in the pipeline of FIG. 1 in the case where the results of instruction n would constitute a denormalized operand in the IEEE format, but where the format presented by execution stage 210 presents these results in a non-IEEE format (e.g., in normalized fashion but with additional exponent bits). This condition requires the reformatting of the results by the on-chip FPU by way of another pass through the pipeline, beginning with issue stage 204 next issuing (instead of instruction n+3) an instruction corresponding to the reformatting of the results of instruction n. As such, instruction n is not completed in the current pass illustrated in FIG. 1. However, instructions n+1, n+2 may be able to complete, for example if their results turn out not to be denormalized operands. While the out-of-order completion of instructions n+1, n+2 provides performance efficiency, an error regarding the floating-point status word can be generated in this case. This is because the floating-point status word is updated upon completion of each of instructions n+1, n+2, and is then again updated upon completion of instruction n after its reformatting is complete. While writeback of the results would be done in program order (with the proper floating-point status word value for instruction n), the floating-point status word would not be rewritten by instructions n+1, n+2 because they have already been completed. Accordingly, the status word associated with these instructions would incorrectly indicate denormalized operands, in this example. In addition, if FPU resources (such as a non-pipelined microsequencer) are required for the reformatting or other completion, additional conflicts may arise if a later instruction (n+1 or n+2) also results in a denormalized operand or other condition in which the same resource is necessary.
A conventional approach to addressing this problem would flush the entire FPU pipeline upon detecting a denormalized result from the third execution stage 210 for any instruction. Of course, this approach reduces the performance of the FPU by precluding the out-of-order execution of the later-issued instructions (e.g., instructions n+1, n+2 in FIG. 1).
By way of further background, a conventional technique used to effect out-of-order completion in pipelined architectures maintains precise exceptions by way of a reorder buffer. The reorder buffer stores the results of every operation until all prior operations are completed. This approach thus introduces overhead operations into the execution of every instruction, and may also require the implementation of a relatively large buffer into the integrated circuit.