A processor having more than one execution unit may employ out-of-order techniques in order to use the execution units in an efficient manner. A macroinstruction in a system memory, when processed by the processor, is decoded into one or more micro-operations (“u-ops”). Each u-op is to be executed by an out-of-order subsystem of the processor. The out-of-order subsystem enables more than one u-op to be executed at the same time, although the u-ops may be executed in a different order than the order in which they were received by the out-of-order subsystem. A processor having an out-of-order subsystem may include a set of architectural registers for storing execution results of u-ops in the order in which the u-ops were received by the out-of-order subsystem (storing the execution result of a u-op in an architectural register is called “retiring” the u-op). The out-of-order subsystem may include a set of temporary registers for storing execution results until such time as those results may be stored in the architectural registers.
One of the architectural registers may be a 32-bit register, where, for example, the entire 32-bit register may be referred to as EAX, the lowest 16 bits of the register may be referred to as AX, the lowest 8 bits of the register may be referred to as AL, and the second-lowest 8 bits of the register may be referred to as AH. An exemplary sequence of u-ops may be as follows:                (1) mov EAX←11223344 (hex)        (2) mov AL←CC (hex)        (3) mov AH←BB (hex)        (4) read EAXThis sequence of four u-ops refers to a single 32-bit register, and therefore the expected and correct result of the read u-op is the value 1122BBCC (hex).        
Since the first, second and third u-ops are independent, they may be executed out of order. Therefore the out-of-order subsystem may use a temporary 32-bit register r1 to store the 32-bit result 11223344 of the first u-op, and may use the lowest 8 bits of a temporary 32-bit register r2 to store the 8-bit result CC of the second u-op, and may use the second-lowest 8 bits of a temporary 32-bit register r3 to store the 8-bit result BB of the third u-op.
The processor will not execute the fourth u-op until all of the first three u-ops have been retired, because until such time, there is no register in the processor that will yield the accurate value for EAX. While the processor is waiting for the first three u-ops to be retired, the dispatching of u-ops that would have been received by the out-of-order subsystem following the fourth u-op is postponed, thus reducing the performance of the processor.
It should be noted that this postponement or stall of the dispatching may occur for any sequence of u-ops including a write to a partial register followed by a read of a larger register (for example writing to AL/AH/AX and reading EAX or writing to AL/AH and reading AX). It should also be noted that the sequence of u-ops may include unrelated u-ops interspersed between the write u-op(s) to the partial register(s) and the read u-op(s) of the larger register(s).
It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity.