In computer processors, achieving a wide “execution width” (the maximum number of instructions that can be dispatched per cycle) requires efficient support for a very large “instruction scheduling window” (conceptually defined as the range from the oldest instruction which has been executed but not yet been retired to the youngest instruction that is being considered for execution).
The performance of general-purpose superscalar processors, with in-order fetch and out-of-order execution, is limited by under-utilization of instruction level parallelism (ILP) that characterizes the inherent parallelism of a program algorithm. Superscalar processors heavily rely on Out-Of-Order (OOO) dispatch/execution to exploit ILP. Since the program code is naturally sequential and instructions are fetched and decoded in-order in most superscalar machines, to allow the OOO dispatch, these machines first need to track data dependencies, use wakeup logic to check whether source operands are ready/available for each instruction, and only after the source operands are available, dispatch instructions OOO to execution units.
In most superscalar processors, after instructions are fetched and decoded in the processor's “Front End”, they enter the instruction scheduling window, where they are allocated buffer resources such as a re-order buffer (ROB), reservation stations (RSs, also referred to as waiting buffers), load buffers and store buffers. The scheduler is where the OOO characteristics (dynamic scheduling) of superscalar machines are achieved. Three pieces of logic are needed to perform dynamic scheduling: rename logic, wakeup/tag comparison logic, and schedule logic.
After instructions have been renamed (e.g., using a register alias table to logically map architectural or logical registers to physical registers), they wait in a RS for their source operands to become available. Each RS entry contains information about an instruction's sources, such as the physical register identifier (tag) for the source, whether the source operand is ready, and the number of cycles it takes the producer (an instruction that resolves a dependency involving a register to allow issue of a consumer instruction that uses the register as a source operand) of the source's value to execute.
Since instructions may be dispatched OOO from the RS, register true dependencies such as read-after-write (RAW) must be detected and resolved. The wakeup logic (or tag comparison logic) checks for such dependencies and is responsible for waking up the instructions that are waiting in the RS for their source operands to become available. Each RS entry is allocated wakeup logic that wakes up the instruction stored in it. This tag comparison is usually implemented using content addressable memory (CAM) or techniques like dependency tracking matrices. Each instruction waiting in the RS will usually have two source operands, both of which need to be available for the instruction to be woken up (i.e., made ready to be considered for scheduling).
FIG. 1 illustrates a data flow graph. SUB instruction 10 is dependent on its parent instructions (ADD 12 and MUL 14) for its source operands, i.e. it consumes the values produced by its parents and hence when it is allocated in RS, the SUB instruction 10 will have to wait for its source operands to become available (ADD 12 and MUL 14 will have to produce their results first). Producer instructions can include both single cycle instructions (e.g., ADD and SUB) as well as multi-cycle instructions (e.g., MUL and DIV). The producer instructions may also be consumers (ADD 12 is a consumer of NOT 16 and DIV 17, MUL 14 is a consumer of NOT 18 and XOR 19). Typically, when an instruction is dispatched (sent for execution), it will broadcast its destination tag on a “destination tag bus” (in FIG. 1, when ADD 12 & MUL 14 are dispatched, their respective destination tags will be broadcast).
FIG. 2 shows an example of wakeup logic for the source operands of one consumer residing in the reservation station in a superscalar processor. The wakeup logic includes a destination tag bus 40 that transmits the broadcasted tags to a comparison logic unit 30. Comparators 34 in the comparison logic unit 30 compare the broadcasted destination tags with the source operand tags of a consumer (e.g., source operand tag 25 and source operand tag 27 of a consumer instruction in an RS entry 20) and indicate if there is a match. Once both source tags are matched, the instruction is considered ready and an “instruction ready” signal is output. A valid bit 22 indicates whether the contents of RSE 20 are valid.
In today's superscalar architectures, the size of the instruction scheduling window directly or indirectly affects the size of hardware structures like RS (as well as ROB, register file, and load/store buffers). These hardware resources tend to scale linearly with the size of the instruction scheduling window. Also, there is an important empirical relationship between the instruction scheduling window size and sustainable execution width, which can be expressed as follows: W˜X2 to X4, where W is the size of the instruction scheduling window and X is the sustainable execution width. Thus, the instruction scheduling window size scales at least quadratically with respect to execution width (i.e., in order to double execution width, the instruction scheduling window must be increased by a factor of 4 to 16, which means the size of the hardware structures like RS must also be increased by a factor of 4 to 16). Accordingly, a significant drawback of the approach in FIG. 2 is that the amount of wakeup logic hardware required scales at least quadratically with respect to execution width.
Additionally, in most superscalar processors, the wakeup logic works on all the entries in the RS. The schedule logic also works on all the RS entries and, based on a ready bit set by the wakeup logic, selects possible candidates (ready instructions) for dispatch along an execution port to an execution unit. Because each RS entry requires comparison logic hardware for waking up the instruction residing in the entry, the wakeup logic hardware will also grow at least quadratically with respect to execution width. This quadratic increase leads not only to an increase in the physical area of the instruction scheduling hardware, but also: leads to severe clock frequency/power implications; significantly limits the ability to increase execution width, and leads to processor performance slowdown if the area/timing/power issues are solved at the cost of performance (e.g., by applying microarchitecture logic and/or algorithms that are not optimized for performance to address these issues).
Processor architectures like TLS (Thread-level speculation) and DE (disjoint Eager Execution) use out-of-order fetch techniques to enlarge the instruction window by splitting program code into multiple threads of execution fetched out of order, but use wakeup logic similar to that used in superscalar processors, and therefore also suffer from quadratic scaling of wakeup logic hardware with respect to execution width. Other architectures, such as those used in various multiscalar processors (e.g., Pinot), mitigate the quadratic growth of wakeup logic by splitting execution resources into multiple processing elements connected in a ring structure. Execution width is increased by increasing the number of processing elements, without increasing ring interconnect bandwidth, leading to a linear growth in the wakeup logic. However, this approach is subject to ring bandwidth limits and also increases the latency with which operands are delivered between instructions executing on different processing elements.
Accordingly, a need exists for more efficient wakeup logic methods and corresponding hardware.