The performance of contemporary general-purpose superscalar processors, with in-order fetch and out-of-order execution, is limited by utilization of instruction level parallelism (ILP) that characterizes the inherent parallelism of a program algorithm. One of the obstacles to making better use of ILP is the sequential nature of most program code, and the corresponding in-order nature of the instruction fetching. In addition to relying on the out-of-order dispatch and execution capabilities to make use of ILP, certain processors also rely on deeper pipelines, as pipelining allows the processors to make use of ILP.
In such processors, in order to achieve desired performance goals (in terms of instructions per cycle (IPC)), some pieces of logic, which may be referred to as critical loops, evaluate in a single execution cycle or else they could be obstacles to the deeper pipelining described above. One such critical loop includes instruction scheduling logic, which may be made up of wakeup logic and select logic.
Wakeup logic, which includes tracking data dependencies and checking if source operands needed by instructions are available, determines when instructions are ready to be sent for execution. Select logic determines, based on some policy, which of these ready instructions should be sent for execution. Select logic may only be applicable when there are more ready instructions than a number of available execution resources. As an instruction cannot be “qualified as ready” by the wakeup logic until all the instructions which it is dependent on are selected and sent for execution, the wakeup logic and the select logic form a critical loop for performance. In addition, the select logic needs to select the right instructions to schedule to the execution units, such that it could move instructions forward on the critical-path (i.e., wake up the dependent instructions). Thus, for every processor with out-of-order instruction execution, select logic which determines which instruction to select first if there are multiple instructions which are ready to occupy a single execution resource and which fits into the timing budget of single-cycle scheduling loop is important for achieving the desired performance.
The criticality of a one-cycle schedule loop and the importance of the right select logic are also relevant for a multi-strand out-of-order processor, which implements an out-of-order fetching technique (i.e. it is capable of fetching instructions out-of-order from different strands of a multi-strand program representation generated by the compiler). Thus, unlike some conventional processors, which fetch an already ordered sequence of instructions and allocate them to the waiting buffer in-order, a multi-strand out-of-order processor is not aware of the program order of the instructions within a strand with respect to instructions from other strands also allocated in the waiting buffer.
There are many scheduling policies currently used by the select logic in conventional out-of-order processors, such as for example, age-based policies, location-based policies, round robin policies, compiler-aided priority policies, split scheduling window approaches, and select free scheduling. These conventional policies, however, have significant limitations when used with a multi-strand out-of-order processor. For example, an age-based policy that schedules instructions for execution based on when they are allocated is not applicable, since the instructions are allocated out-of-order. Location-based policies and round robin policies, which prioritize instructions based on their location in the waiting buffer, have lower hardware costs but suffer from decreased performance. Select free scheduling, which removes the select logic from the critical path by pipelining select logic into a 1-cycle wakeup loop and a 2-cycle select loop, increases the clock frequency, but at the cost of compromising IPC.
Thus, a select logic scheduling policy is needed that can make use of ILP to achieve higher performance in terms of IPC while still meeting the timing requirements of the critical single-cycle scheduling loop, without growing the complexity of the select logic in a multi-strand out-of-order processor.