A scheduler is an important component of a processor, and can significantly impact the performance and frequency of a processor core. The performance can be in terms of IPC (instructions per cycle) and the frequency can be in terms of a critical path. FIG. 1 illustrates a structure of a conventional scheduler 100. As seen, the scheduler 100 includes an instruction silo 110, a wakeup block 120, and a picker block 130. Both the wakeup block 120 and the picker block 130 are typically implemented as matrices.
The scheduler 100 can be viewed as a part of the processor core where instructions wait to be dispatched into execution lanes. The following operations—allocation, picking, and wakeup—are associated with the scheduler 100. Before an instruction can be executed, it is initially allocated into the instruction silo 110. Multiple instructions can be allocated. That is, the instruction silo 110 can hold some number of instructions, and every issued instruction is allocated a row.
A primary purpose of the wakeup block 120 is to identify instructions that are ready for execution. An instruction can be ready when all operands of that instruction are available. A ready instruction can bid for grants, i.e., bid for execution privileges. Multiple ready instructions can bid simultaneously, e.g., in the same cycle. The picker block 130 picks or selects one or more ready instructions and grants permission(s) to the selected instruction(s) for dispatch to execution units (e.g., adder, multiplier, shifter, etc.) for execution.
Note that in FIG. 1, there is a “scheduler loop” between the wakeup block 120 and the picker block 130. The scheduler loop includes bids in the direction from the wakeup block 120 to the picker block 130 and grants in the direction from the picker block 130 to the wakeup block 120. The scheduler loop is due to dependencies among the instructions. For example, consider two instructions X=A+B and Y=X+C. In this instance, the second instruction Y=X+C is dependent on the first instruction X=A+B since the operand X of the second instruction is provided by the execution of the first instruction.
FIG. 2 illustrates an example of a picker block 130 implemented as a matrix. A job of the picker block 130 is to arbitrate among bidding instructions. One criteria for arbitration can be age of the instruction. For example, older instructions may be prioritized over newer instructions. In this implementation, it is assumed that instructions are divided into groups. This means that priorities should be maintained between the groups. In the picker matrix, each group owns an arbitration column through which it kills grants to instructions newer than that group. The picking operation can be split into two parts—inter-group picking and intra-group picking. In the inter-group picking, the oldest group is picked. In the intra-group picking, the oldest instruction within the group is picked.
In FIG. 2, inter-group arbitration is illustrated. In this figure, it is assumed that instructions I0, I1 and I2 respectively belong to instruction groups (or simply “group”) G0, G1 and G2. It is also assumed that G2 is older than G1 which is older than G0. In other words, the age priority among these groups can be expressed as G2 >G1 >G0. In the picker matrix, every instruction has a row and every group has a column. For each cell of the picker matrix, an age bit of that cell that indicates whether the instruction assigned to this row is newer (value 1) or not newer (value 0) than the group assigned to this column. For convenience, each cell will be addressed as a (row, column) pair. Since G2 is older than both G1 and G0, the age bits of cells (I0, G2) and (I1 G2) are both “1”. Also, since G1 is older than G0, the age bit of cell (I0, G1) is “1”.
Now assume that in a cycle, both I0 and I2 are bidding (indicated through thick lines). The columns of corresponding groups G0 and G2 are also asserted (also indicated as thickened lines) since I0 and I2 belong to groups G0 and G2. These column lines are also referred to as “conflict” lines. Since I0 belongs to G0 that is newer than G2 (age bit of cell (I0, G2) is 1), G2 kills the grant to I0 (as indicated by “X” at the I0 and G2 intersection). On the other hand, the grant to I2 is not killed.
As mentioned, FIG. 2 illustrates inter-group arbitration in which the oldest group is selected. But recall that there is also intra-group arbitration for each group of instructions. Once the oldest group is selected, the intra-group picker selects the oldest instruction within the oldest group.
FIG. 3 illustrates a circuit detailing a cell of the picker block 130 of FIG. 2. In FIG. 3, it is assumed that a signal is set/reset (asserted/not asserted) when the voltage of the signal line is high/low. During a cycle, the grant lines of instructions are subjected to two phases—precharge and evaluation. During the precharge phase, the grant line for each instruction is precharged to high. During the evaluation phase, each instruction is evaluated. If the evaluation determines that the instruction should not be granted, the grant line is discharged. In other words, if the instruction grant evaluates to false, the grant line is pulled down. FIG. 3 shows multiple paths through which a grant line is evaluated. As seen, there are three pull-down conditions for grant of an instruction:                the instruction is not ready (˜ready_clk signal);        the instruction does not belong to the oldest group (conflict_in signal); and        the instruction is not the oldest instruction within the group (kill_in signal).        
In the wakeup block 120, a granted instruction can wake up its dependents, and the dependent instruction can bid in the next cycle. FIG. 4 illustrates an example of this dependency wakeup approach in which the wakeup block 120 is implemented as a matrix with instructions I0, I1 and I2 with the critical path shaded (explained further below). This wakeup matrix includes one row and one column for each instruction. For each cell, a dependency bit of that cell that indicates whether the instruction assigned to this row is dependent on the instruction assigned to this column. For convenience, each cell will be addressed as a (row, column) pair. In this instance, it is assumed that I1 is dependent on I2 illustrated with the dependency bit of cell (I1, I2) being set.
In FIG. 4, it is also assumed that the grant line for I0 is asserted, i.e., the picker (not shown) has granted I0 for dispatch as indicated by the blocking of the “I0 grant” row. Each granted instruction announces or broadcasts that it has been granted as indicated by blocking of the “I0 broadcast” column. For each instruction dependent on the granted instruction, the dependency is cleared, which is indicated by the transition of the bit cell value from “1” to “0”. Effectively, the broadcast resets all dependency bits along that column. In this figure, the dependency bits of cells (I1 I0) and (I2, I0) are cleared. When all dependency bits are cleared for an instruction, that instruction is ready and bids for execution in the next cycle. For example, note that I2 has no more remaining dependencies after the grant to I0. This means that I2 can bid in the next cycle as indicated by the blocking of the “I2 bid” row via the dependency bit of cell (I2, I0). On the other hand, I1 is not yet ready since its dependency on I2 remains even after the grant to I0. Therefore, I1 will not bid in the next cycle.
For performance, it is desirable to have a single scheduler with a large window size. Unfortunately, a big scheduler may not be able to achieve the frequency requirements, and thus limit the frequency of the processor core. A conventional solution to this problem is to divide instructions among multiple single-instruction pickers as illustrated in FIG. 5. In this figure, the scheduler 500 includes a wakeup block 520, multiple picker blocks 530 (530-1, 530-2, 530-3, 530-4), and four execution units XUs 540 (540-1, 540-2, 540-3, 540-4). Each picker block 530 selects a single instruction per cycle for execution in its corresponding execution unit 540. Such a conventional solution with plural single-instruction pickers would require that instructions be divided up among the picker blocks 530 and each picker block 530 can pick an instruction for execution per cycle. Then theoretically, there can be as many as four instructions being executed simultaneously.
Unfortunately, such division of instructions can cause problems due to fragmentation. This can lead to uneven distribution of ready instructions among the picker blocks 530. For example, the picker block 530-1 may have two ready instructions, and the picker block 530-2 may have no ready instructions. This means that between the picker blocks 530-1 and 530-2, only one instruction can be dispatched (from the picker block 530-1) for execution during a cycle. In such a scenario, there can be ready instructions that are not dispatched even though there are free execution lanes.