In an Out-Of-Order (“OOO”) microprocessor, instructions are allowed to issue and execute out of their program order. The instructions in an OOO microprocessor execute out of order while still preserving data dependence constraints. Because instructions may finish in an arbitrary order, the architectural register file in an OOO processor pipeline cannot be modified by the instructions as they finish because it would make it difficult to restore their values accurately in the event of an exception or an interrupt. Hence, every instruction that enters the pipeline is provided a temporary entry in a physical register file where it can save its result. The temporary entries in the physical register file are eventually written into the architectural register file in program order when the instructions “retire.”
The write back module in a conventional OOO micro-architecture will write the resulting values from instructions being executed out of order back to the physical register file in the re-order buffer (ROB) first. The ROB keeps track of the program order in which instructions entered the pipeline and for each of these instructions, the ROB maintains temporary register storage in the physical register file. When the oldest instructions in the ROB produce a valid result, those instructions can be safely committed. That is, the results of those instructions can be made permanent since there is no earlier instruction that can raise a mispredict or exception that may undo the effect of those instructions. When instructions are ready to be committed, the ROB will move the corresponding values in the physical register file for those instructions to the architectural register file so the instructions can retire. Therefore, through the ROB's in-order commit process, the results in the architectural register file are made permanent and architecturally visible.
Certain conventional distributed OOO processor designs make use of distributed physical register files. Distributing physical register files into two or more units reduces the area and routing that is typically required for a single unit. In a distributed design, execution ports are tied to discrete physical register file units. For example, FIG. 1 illustrates a block diagram for a conventional distributed design with two wings, wherein the physical register file is distributed over the two wings. As shown in FIG. 1, Wing 0 110 comprises Execution Unit 0 106, Architectural Register File 0 104, and Physical Register File 0 108, while Wing 1 130 comprises Execution Unit 1 122, Architectural Register File 1 120, and Physical Register File 1 124.
As instructions are issued from the Issue Queue (not shown) within Scheduler 172, they are executed by an execution unit within one of the wings. The instruction then writes its output to a respective physical register file and, as the instruction retires, its register destination is moved to a respective architectural register file. For example, an instruction that is executed in Execution Unit 0 106 writes its register output to Physical Register File 0 108. As the instruction retires, its register destination is moved to Architectural Register File 0 104.
FIG. 2 illustrates a block diagram showing a static mapping technique used in conventional distributed designs for pairing select ports in a scheduler with execution ports in an execution unit of an OOO processor. The scheduler in a conventional OOO processor selects and dispatches multiple instructions per cycle with static ordering. For example, in FIG. 2, scheduler 272 can select 4 instructions through select port 0 204, select port 1 206, select port 2 220, and select port 3 222 based on age order. Accordingly, the oldest or most high priority instruction will be selected by select port 0 204. The select ports shown in FIG. 2 and other figures, also comprise select logic used to pick the appropriate instructions. It should be noted that while the example of FIG. 2 selects only 4 instructions, a typical scheduler can have any number of select ports to select instructions.
The select ports of the conventional OOO processor illustrated in FIG. 2 are tied with specific execution ports. Accordingly, select port 0 204 is tied to execution unit port 0 208, select port 1 206 is tied to execution unit port 2 212, select port 2 220 is tied to execution unit port 1 210, and select port 3 222 is tied to execution unit port 3 214. Execution unit port 0 208 and execution unit port 1 210 are part of Execution Unit 0 232, which write their output to Physical Register File 0 281, which in turn drains into Architectural Register File 0 280. Similarly, execution unit port 2 212 and execution unit port 3 214 write their output to Physical Register File 1 284, which in turn drains into Architectural Register File 1 283.
As shown in FIG. 2, select port 1 206 connects to execution unit port 2 212 in Wing 1 230, while select port 2 220 connects to execution unit port 1 210 in Wing 0 210. The ports are cross-linked as shown in FIG. 2 in order to load balance between Wing 0 and Wing 1. If, for example, there are only two ready instructions in a given cycle that are selected by the first two select ports, select port 0 204 and select port 1 206, instead of routing both instructions to Execution Unit 0 232, the load between Wing 0 and Wing 1 is balanced by sending one instruction to execution unit port 0 and the other instruction to execution unit port 2. Select port 0 will typically pick the oldest or most high priority instruction and the remaining ports will pick instructions in order of decreasing age.
Due to timing constraints in conventional complex scheduler design, select units and execution ports are statically mapped as shown in FIG. 2. In the illustrated design, for example, select port 0 204 is tied to execution unit port 0 208.
However, a problem arises in conventional designs such as the one illustrated in FIG. 1 when Physical Register File 0 281 is full while the oldest instruction in the scheduler is yet to be executed. In an OOO machine where instructions are executed in OOO fashion, retirement of instructions has to be done in-order to maintain the program order. Because the oldest instruction is not yet executed, Scheduler 272 cannot retire any younger instructions and, therefore, the Physical Register File 0 281 cannot be drained to Architectural Register File 0 280. Meanwhile, Scheduler 272 cannot dispatch the oldest instruction because the select unit used to pick the oldest instruction is statically tied to execution unit port 0 208 in Wing 0 210.
Under the best circumstances, this results in inefficiency and will affect performance, while in the worst circumstances, it may cause a deadlock. If a deadlock results, a flush of the entire pipeline is required to recuperate. Instructions are accordingly flushed out of Scheduler 272 and Physical Register File 0 281 and dispatched again. However, the same deadlock condition could arise again when instructions are re-dispatched and, therefore, statically mapping select ports and execution ports can be problematic.
One conventional technique that has been used to address the problem of deadlock is illustrated in FIG. 3. FIG. 3 illustrates a block diagram showing a technique used in conventional distributed designs for pairing select ports in a scheduler with execution ports in an execution unit of an OOO processor wherein the scheduler is split into two blocks. The scheduler in FIG. 3 is split into Scheduler Block A 372 and Scheduler Block B 373. Select port 0 304 and select port 1 306 select and dispatch instructions from Scheduler Block A to Execution Unit 0 306 while select port 2 320 and select port 3 322 select and dispatch instructions from Scheduler Block B to Execution Unit 1 307. Execution Unit 0 306 writes its output to Physical Register File 0 308, which in turn drains its output to Architectural Register File 0 304. Similarly Execution Unit 1 307 writes its output to Physical Register File 1 324, which in turn drains its output to Architectural Register File 1 320.
The design illustrated in FIG. 3 addresses the problem of deadlock because each of the scheduler blocks has the same number of entries as the corresponding physical register file. For example, Scheduler Block A 372 has the same number of entries as Physical Register File 0 308. Therefore, there is no way that a physical register file will be full if there are still undispatched entries in the corresponding scheduler block. Stated differently, the physical register file will always have room for any undispatched instructions from a corresponding scheduler block.
Even though deadlock is prevented, the design in FIG. 3 unfortunately still results in some inefficiency. For example, if Scheduler Block A 372 has 5 ready instructions that the select ports have to choose from while Scheduler Block B 373 only has 1 ready instruction, it results in a reduction in the dispatch rate. For example, select port 0 304 and select port 1 306 can only pick 2 of the 5 ready instructions for dispatch at a time. Meanwhile, the select ports in Scheduler Block B 373 are not fully utilized because there is only 1 ready instruction in Scheduler Block B. Accordingly, an extra cycle will be required to execute all 5 ready instructions in Scheduler Block A 372 than if the instructions could have been distributed evenly between the two blocks. Hence, the dispatch rate suffers.
Conventional processor techniques of tying scheduler select ports statically to execution units, therefore, are problematic because they can either result in deadlock or load imbalance between different units of the physical register file.