1. Field of the Invention
The present invention relates in general to the field of out-of-order execution microprocessors, and more specifically whose out-of-order execution microprocessors architectures permit instructions to address variable size operands.
2. Description of the Related Art
Microprocessors include an architectural set of registers that store instruction operands. For example, the highly popular IA-32 Intel(R) architecture (also commonly referred to as the x86 architecture) includes many registers in its register set, including eight 32-bit general purpose registers, referred to as EAX, EBX, ECX, EDX, EBP, ESI, EDI, and ESP. An instruction may specify registers of the architectural register set to be source registers that store source operands upon which the instruction operates to generate a result. An instruction may also specify a register of the architectural register set to be a destination register that is to receive the result of the instruction. For example, an x86 ADD EAX, EBX instruction adds the contents of the EAX and EBX registers to generate a 32-bit result (i.e., a sum) and places the 32-bit result in the EAX register. Thus, the ADD EAX, EBX instruction specifies the EAX and EBX registers as storing source operands and specifies the EAX register for receiving the result of the addition of the source operands.
Superscalar out-of-order execution processors include multiple execution units that can execute respective instructions in parallel during each clock cycle. In particular, independent instructions can execute out-of-order with respect to one another. However, instructions that consume, or depend upon, (consumer instructions or dependent instructions) the result of a prior instruction (supplier instructions) in the program must wait to execute until the supplier instruction has generated its result and made the result available to the consumer instruction. The processor includes an instruction buffer (or perhaps multiple instruction buffers, such as one per execution unit) that store instructions waiting to be executed by the execution units. The source operands specified by a consumer instruction are often the result of a supplier instruction that has already been executed by an execution unit, is currently being executed by an execution unit, or has yet to be executed by an execution unit. The instructions wait in the instruction buffer to be issued to an execution unit until the source operands are available. When an execution unit generates a result, it writes the result to a non-architecturally-visible register in a renamed register set. However, because the instructions must be retired to the architected register set in program order (due, for example, to the possibility of a branch misprediction or exception condition), it may be several clock cycles until the result is retired from the renamed register set to the architected register set. A superscalar out-of-order execution microprocessor typically improves performance by issuing instructions to an execution unit as soon as possible. In particular, the microprocessor typically forwards supplier instruction results directly to the execution unit that will execute a consumer instruction, which may enable the consumer instruction to execute several clock cycles sooner than it otherwise would if it waited to receive the source operands from the architected register set. The microprocessor determines when a source operand is available by comparing tags output by the execution units identifying the supplier instruction whose result is being output by each execution unit with a dependency tag that identifies the supplier instruction of the source operand.
Some processor architectures allow instructions to write a result that is smaller than the default operand size of the architected register set of the microprocessor, which is 32 bits in the x86 architecture, for example. That is, the instructions write only to a subset of individual registers in the architectural register set. For example, in the IA-32 architecture, the lower 16 bits of the eight general purpose registers map directly to the register set found in the Intel 8086 and 80286 processors and can be referenced with the names AX, BX, CX, DX, BP, SI, DI, and SP. Each of the lower two bytes of the EAX, EBX, ECX, and EDX registers can be referenced by the names AH, BH, CH, and DH (high bytes) and AL, BL, CL, and DL (low bytes). This enables older programs written for the older processors in the x86 family to execute on current IA-32 processors. For example, an x86 ADD AL, BL instruction adds the contents of the AL and BL registers to generate an 8-bit result (i.e., a sum) and places the 8-bit result in the AL register. Thus, the ADD AL, BL instruction specifies the lower 8-bit subsets of the 32-bit EAX and EBX registers as storing source operands and specifies the lower 8-bit subset of the 32-bit EAX register for receiving the result of the addition of the source operands.
A side effect of allowing a supplier instruction to write a result to a subset of an architected register, i.e., to write a result that is smaller than the default operand size of an architected register, is that a subsequent consumer instruction that specifies the full default size architected register as a source operand now depends upon the result of two or more different instructions. For example, consider the following sequence of instructions of a program:
(1) ADD EAX, EBX←supplier instruction
(2) ADD AL, CL←supplier instruction
(3) ADD EDX, EAX←consumer instruction
According to this instruction sequence, the ADD EDX, EAX consumer instruction specifies the 32-bit EAX register as a source operand. The lower 8 bits of the source operand are the result of the ADD AL, CL supplier instruction, whereas the upper 24 bits of the source operand are the upper 24 bits of the result of the ADD EAX, EBX supplier instruction. Thus, the consumer instruction must wait in the instruction buffer to be issued until the result of both the supplier instructions are available to the execution unit that will execute the consumer instruction. There are at least two known ways to ensure that the consumer instruction waits in the instruction buffer to issue until all the results of the instructions upon which it depends are available.
According to a first solution, referred to herein as serialization, the microprocessor waits to issue the consumer instruction until all the active instructions in the microprocessor that are older than the consumer instruction have retired their results to the architected register set. This guarantees that the supplier instructions on which the consumer instruction depends have updated the architectural register set such that the execution unit executing the consumer instruction will receive the proper source operand value, which in the example above, guarantees the consumer instruction will receive the 8-bit result of the ADD AL, CL supplier instruction and the upper 24-bit result of the ADD EAX, EBX supplier instruction. However, this is not a very high performance solution since it reduces well-known performance advantages of superscalar out-of-order execution because the consumer instruction must wait longer to execute than it would if the rename register set and/or the execution units generating the supplier instruction results were allowed to immediately provide their results to the execution unit that will execute the consumer instruction.
According to a second solution, the microprocessor may include additional circuitry to determine immediately when both the supplier instruction results are available. Although this solution may yield higher performance than the first solution, an undesirable consequence of the second solution is that it requires more circuitry to implement than the first solution. In particular, the first solution requires a large number of comparator circuits to compare the tags identifying instructions completed by the execution units with the source operand tags of the instructions waiting in the instruction buffer. Specifically, the first solution requires at least N comparators, where N is the product of: (1) the number of instructions that may be waiting in the instruction buffer; (2) the number of possible source operands specified by each instruction waiting in the instruction buffer; and (3) the number of execution units that may provide an instruction result each clock cycle. In contrast, the second solution requires at least double the number of comparators required by the first solution. This is because, as illustrated above, two different execution units may provide their result as a portion of a given source operand to a waiting consumer instruction. That is, the second solution must include the N comparators required by the first solution for as many different portions of each source operand as the source operand may depend upon, which in the example above is twice as many, thus requiring 2N comparators.
Furthermore, in the x86 architecture, the second solution requires triple the number of comparators required by the first solution because instruction sequence scenarios may occur in which a source operand is dependent upon three prior instructions, as illustrated by the following code sequence.
(4) ADD EAX, EBX←supplier instruction
(5) ADD AH, DH←supplier instruction
(6) ADD AL, CL←supplier instruction
(7) ADD EDX, EAX←consumer instruction
According to this instruction sequence, the ADD EDX, EAX consumer instruction specifies the 32-bit EAX register as a source operand. The lower 8 bits of the source operand are the result of the ADD AL, CL supplier instruction, the next 8 bits of the source operand are the result of the ADD AH, DH supplier instruction, and the upper 16 bits of the source operand are the upper 16 bits of the result of the ADD EAX, EBX supplier instruction. Thus, the consumer instruction must wait in the instruction buffer to be issued until all three of the supplier instruction results are available to the execution unit that will execute the consumer instruction. Furthermore, the control logic that receives the comparator outputs and determines whether all the source operand components are available is larger in the second solution than the first due to the increase in the number of comparator outputs. Finally, even more circuitry than the additional comparators and control logic is required with the second solution because byte enables that specify which portion of the default operand size a small result is destined for must be stored and routed to the relevant control logic.
A negative consequence of the fact that the second solution requires more circuitry to implement than the first solution is that the additional circuitry takes up a larger amount of semiconductor die area and consumes more power than the first solution. As discussed above, the amount of additional circuitry required is a function of the number of instructions that may be waiting in the instruction buffer, the number of possible source operands specified by each instruction waiting in the instruction buffer, and the number of execution units that may provide an instruction result each clock cycle. Although the number of possible source operands specified by each instruction for a given architecture is fixed, the trend in modern microprocessors is toward increasing the number of execution units to increase the instructions per second that the microprocessor may execute. Furthermore, the trend is toward increasing the number of entries in the instruction buffer in order to increase instruction look-ahead to improve instruction level parallelism to take advantage of the larger number of execution units. Thus, as these parameters of a microprocessor increase, the amount of die area and power consumption consumed by the additional circuitry may be substantial.
Therefore, what is needed is a solution that does not stifle the performance gains of an out-of-order superscalar microprocessor design, but that is also efficient in terms of die area and power consumption.