A primary function of a processor is to perform a stream of operations, such as a stream of computer instructions. Some processors are designed to completely perform one operation in the stream before beginning to perform the next operation. With these "in-order" processors, the result of one operation is correctly used by later operations that "depend" on it. Consider the following instructions:
Load memory-1.fwdarw.register-X PA1 Add register-X register-Y.fwdarw.register-Z. The first instruction loads the content of memory-1 into register-X. The second instruction adds the content of register-X to the content of register-Y and stores the result in register-Z. The second instruction is a "child" operation that depends on the first instruction, or "parent" operation. If the result of the first instruction is not stored in register-X before the second instruction is executed, an incorrect result will be stored in register-Z. Note that a single operation may have more than one parent, more than one child, and may be both a parent and a child with respect to different operations.
To improve a processor's performance, operations can be performed "out-of-order." For example, if data for one instruction in a stream is not ready at a particular time, the processor may execute another instruction later in the stream. In this case, a "scheduler" can schedule instructions so that a child instruction will not be performed before its parent instruction. This improves processor performance because the processor does not remain idle until the first instruction's data is ready.
Computer instructions are not the only operations that have such dependencies. For example, memory operations may be scheduled so that information is stored into a memory location before information is read from that memory location by a later operation. Other examples include scheduling operations based on limited execution resources, memory resources, register resources, slot availability or bus availability. By way of example, the scheduling of micro-operations, also known as ".mu.ops" or "uops," will be used herein to describe known scheduling techniques.
FIG. 1 is an overview of a known system for processing instructions and uops. The system includes an instruction fetch and decode engine 110 that decodes an instruction stream into a series of in-order ops that represent the data flow of the instruction stream. The instructions can be decoded, for example, into uops with two logical sources and one logical destination. The uops are "issued" from the instruction fetch and decode engine 110 to a renaming and allocation unit 120. If a processor has only a limited number of physical registers, the renaming and allocation unit 120 maps logical register references to physical register references.
The uops are then sent to a scheduler 130, which stores several pending uops and selects from this group, or "queue," the uop or uops that will be performed next. The scheduler 130 selects uops such that a child uop will not be performed before its parent uop. That is, the scheduler 130 decides if every source register used by a uop is ready to be used. If all of the uop's sources are ready, and if execution resources are available, the uop is sent, or "dispatched," to a execution resource 140 where the operation is performed. Thus, uops are dispatched based on data flow constraints and resource availability, not the original ordering of the stream.
Known schedulers are typically based on the "Tomasulo" scheduler. FIG. 2, a block diagram of such a Tomasulo scheduler, shows two issued uops, Add1 and Add2, that have been received by a scheduler 200. Each uop has two sources and a destination. Add1 sums the contents of register 1 (r1) with the contents of r2. The result is stored in r3. Add2 sums the contents of r3 with the contents of r2 and stores the result in r4. As can be seen, Add2 depends on, and is the child of, Add1 . The scheduler 200 includes a ten-bit scoreboard 210 that is used to keep track of which registers are ready. Each bit represents a register, and, for example, a "0" indicates that the register is not ready while a "1" indicates that the register is ready. If Add1 has not been executed, the bit associated with r3 in the scoreboard 210 is set to "0," indicating that r3 is not ready.
An active scheduler 220 uses the scoreboard 210 to determine if a uop is ready for dispatch. For example, the active 220 scheduler looks at the bits associated with r3 and r2 when considering Add2. If the scoreboard 210 reflects that both sources are ready, the active scheduler 220 dispatches the uop for execution. If either source is not available, the uop is not dispatched. After the uop is executed, the scoreboard 210 is updated to reflect that 4 is now ready.
FIG. 3 illustrates circuitry associated with a Tomasulo scheduler. When a uop is written, or allocated, into the Tomasulo scheduler, its sources are read from the scoreboard 210. If the scoreboard 210 indicates that the sources are ready, the uop is ready to schedule. Sources that are ready in the scoreboard 210 are marked ready in the scheduler. Sources that are not ready will monitor the result bus. The value of a pending uop's source register 310 is matched against the value of completed uops on the destination, or result, bus using a group of compares 320. The outputs from the group of compares 320 are input to a wide OR 330, and the output of the wide OR is stored as a ready bit 340 for the first source. Similar logic (not shown in FIG. 3) is performed to generate a ready bit for the pending uop's second source. When all of the pending uop's sources are ready, as determined by the output of the logic gate 350, the uop is ready for dispatch. This logic is repeated for each pending uop, such as entries 1 to n. If multiple uops are ready to dispatch, priority logic 360 determines which uop will be dispatched. A lookup is performed to determine the destination register 370 of the dispatching uop, and this value is driven on a result bus.
The Tomasulo scheduler uses a "tight" scheduling loop as shown in FIG. 4. For each pending uop, the scheduler monitors the result bus and compares the destination of executed uops with the pending uop's sources at 410. Next, the scheduler performs ready determination logic 420 to determine the dispatch readiness of the pending uop. For every source used by the pending uop, the results of the comparison performed at 410 are ORed at 430. The results for each source are then ANDed at 440. Only if every source is ready does the scheduler determine that the uop is ready for dispatch.
Several uops may be ready for dispatch at one time. If more than one uop is ready, prioritization is performed at 450 to determine which of the ready uops should be dispatched first. Finally, the pending uop is dispatched at 460. When a uop is dispatched, the scheduler repeats the actions described above, resulting in the tight scheduling loop that determines when pending uops are ready for execution.
There are a number of disadvantages, however, to known scheduling techniques. For example, the basic motivation for increasing clock frequencies is to reduce instruction latency. Suppose that a part of a program contains a sequence of N instructions, I.sub.1, I.sub.2, . . . , I.sub.N. This part of the program may also contain any other instructions. Suppose also that each instruction requires, as an input, the result of the previous instruction. Such a program cannot be executed in less time than T=L.sub.1 +L.sub.2 +. . . +L.sub.N, where L.sub.n is the latency of instruction I.sub.n, even if the processor was capable of executing a very large number of instructions in parallel. Hence, the only way to execute the program faster is to reduce the latencies of the instructions.
Moreover, when each uop in a stream is dependent on the previous uop, a scheduler must perform one full iteration of the tight scheduling loop for each dispatched uop. This becomes the minimum "latency" of each uop. The latency of a uop may be defined as the time from when its input operands are ready until its result is ready to be used by another uop. Additionally, the speed of an instruction through the multi-stage system shown in FIG. 1 is limited by the speed of the slowest unit, or "weakest link," in the chain.
The speed of a processor in uops-per-second, or S, can be expressed as S=P/L, where P is the average parallelism and L is the average uop latency in seconds. A key advantage of a scheduler is that it increases the value P, which improves the processor's performance. However, an execution unit is typically able to execute a common uop, such as an add, with a latency that is less than the latency of the tight scheduling loop. Therefore, the use of the scheduler also increases the value of L, which limits the processor's performance.
For example, comparing the destination register of dispatched uops to all sources of all pending uops may take a long time, such as from 4 to 6 gate operations. This--in addition to the ready determination logic which may take 1 or 2 gates, the prioritization which may take another 1 or 2 gates and destination lookup of 2 or 3 gates--results in a tight loop that takes from 8 to 13 gate operations. Moreover, the scheduler may have to monitor a number of different result buses, which increases the amount of comparing that must be performed. The growing number of registers used in processors, as well as the increasing frequencies of processor operation, make the current system of scheduling operations impractical.