Modern microprocessors typically have multiple functional units, such as an integer unit (IU), a floating-point unit (FPU) and a vector arithmetic unit, such as an MMX unit (MXU), for executing integer, floating-point, and multi-media instructions, respectively. Each of the functional units is a pipeline of multiple stages, each of which executes a portion of an instruction or operation as it moves through the stage.
The FPU and MXU typically require more clocks to execute an instruction than the IU, because the floating-point and multi-media instructions typically involve lengthier arithmetic computations. The fact that the FPU and MXU require more clocks to execute an instruction than the IU can cause the IU pipeline to stall in some situations, such as when the FPU or MXU is not yet ready to receive another instruction or operation. Additionally, the fact that the FPU and MXU require many clocks to execute can cause inefficiency in the data cache since the data cache may have to stall because the FPU or MXU is not ready to receive data that the cache is ready to provide to it. To solve these problems, an instruction and data queue may be added to the FPU or MXU for receiving instructions and their associated data in order to allow the IU and data cache to continue running.
Microprocessors include the notion of a user-visible state of the microprocessor. For example, in x86 architecture processors, the user-visible state includes the user-visible register file, which includes general purpose registers such the EAX register, registers associated with functional units, such as the floating-point registers, and other registers such as the flags register.
An instruction is not permitted to update the user-visible state, e.g., to modify the register file, until it has reached a point of completion defined by the processor architecture. This is because certain events or conditions may occur that invalidate the instruction. That is, certain events or conditions may occur such that the processor should stop executing the instruction, and in particular, should not update the user-visible state of the processor. For example, the instruction may have been speculatively fetched and entered into the execution stream based on a branch instruction prediction. If the processor later determines that the branch was mispredicted, the instructions following the branch that were speculatively executed must be invalidated and not allowed to update the user-visible state of the processor, even though they may have been partially completed by the various stages of the functional unit pipelines. Another example of an invalidating event is an exception, such as a page fault exception, general protection exception, or invalid opcode exception. Additionally, an instruction may simply be invalid to begin with in the pipeline. The most common cause of this condition is a stall or bubble caused by a miss in the instruction cache resulting in pipeline stages being void of valid instructions.
When a functional unit such as an FPU is ready to finish executing an instruction, the FPU needs to update the user-visible state of the processor based on the particular instruction being executed. In order to update the user-visible state, the FPU must know that the instruction is still valid, i.e., that the instruction is authorized to update the user-visible state of the processor. In order to insure that the instruction is still valid, a conventional microprocessor, places the queue at the end of the functional unit that performs the instruction validation function.
For example, in microprocessor 100 of FIG. 1, an integer pipeline 104 is the functional unit that performs the instruction validation function. That is, invalidating conditions, such as branch mispredictions or exceptions, are reported to the integer pipeline 104, which keeps track of whether an instruction or operation is valid based on the conditions reported to it. The conventional microprocessor 100 of FIG. 1 places the queue 106 at the end of the integer pipeline 104, as shown. Hence, an instruction 102 must proceed through the integer pipeline 104 before being placed into the FPU queue 106. By placing the queue 106 at the end of the integer pipeline 104, it is guaranteed that no events or conditions can occur to invalidate the instruction or operation once it reaches the end of the integer pipeline 104. Therefore, the instruction is guaranteed to be valid once it reaches the end of the pipeline 104 and enters into the FPU queue 106.
However, there is a disadvantage to placing the queue at the end of the functional unit that performs the validation function. By requiring the instruction to proceed to the end of the validating functional unit pipeline before entering the queue, the instruction incurs the additional latency of having to pass through the bottom stages of the validating functional unit pipeline potentially unnecessarily before entering the queue. That is, the functional unit may be capable of receiving the instruction to begin execution of it at a stage of the validating functional unit pipeline well before the end of the pipeline. For example, the data cache may have already provided the data needed by the other functional unit, such as an FPU, at a stage in the middle of the validating functional unit pipeline. Hence, the clock cycles required for the instruction to pass through the remaining validating functional unit pipeline stages constitute an unnecessary latency.
An example where the additional latency is problematic is in the case of an MXU that provides integer multiplication facilities for the integer unit. Because the MXU includes an integer multiplier for executing MXU multiply instructions, the integer multiplier in the integer unit could be eliminated to reduce the size of the microprocessor circuit die size, and the integer multiply instructions could be executed instead by the MXU integer multiplier. However, due to the fact that integer multiply instructions are relatively frequent in program instruction sequences, the additional latency to an integer multiply introduced by placing the MXU queue at the end of the integer unit pipeline may be intolerable.
However, by placing the MXU queue architecturally at a stage after which the instruction could be invalidated, the MXU is no longer guaranteed that the instruction is still valid once it enters the MXU queue. That is, because the MXU queue is located before the end of the integer pipeline, an invalidating condition may occur while the instruction is in the MXU queue or while the MXU is executing the instruction after having received the instruction from the queue. As an instruction proceeds through the IU and MXU pipelines, it is no longer in lock step. Consequently, the MXU does not know whether it can update the user-visible state, since, for example, the integer unit may have invalidated the instruction during any interval of latency in the MXU queue.
Therefore, a mechanism is needed for maintaining coherency of instruction status between functional units due to the unalignment introduced by the functional unit queues.