The present invention relates to the scheduling of operations in a processor. More particularly, the present invention relates to a method and apparatus for scheduling operations using a dependency matrix.
A primary function of a processor is to perform a stream of operations, such as a stream of computer instructions. Some processors are designed to completely perform one operation in the stream before beginning to perform the next operation. With these xe2x80x9cin-orderxe2x80x9d processors, the result of one operation is correctly used by later operations that xe2x80x9cdependxe2x80x9d on it. Consider the following instructions:
Load memory-1xe2x86x92register-X
Add register-X register-Yxe2x86x92register-Z.
The first instruction loads the content of memory-1 into register-X. The second instruction adds the content of register-X to the content of register-Y and stores the result in register-Z. The second instruction is a xe2x80x9cchildxe2x80x9d operation that depends on the first instruction, or xe2x80x9cparentxe2x80x9d operation. If the result of the first instruction is not stored in register-X before the second instruction is executed, an incorrect result will be stored in register-Z. Note that a single operation may have more than one parent, more than one child, and may be both a parent and a child with respect to different operations.
To improve a processor""s performance, operations can be performed xe2x80x9cout-of-order.xe2x80x9d For example, if data for one instruction in a stream is not ready at a particular time, the processor may execute another instruction later in the stream. In this case, a xe2x80x9cschedulerxe2x80x9d can schedule instructions so that a child instruction will not be performed before its parent instruction. This improves processor performance because the processor does not remain idle until the first instruction""s data is ready.
Computer instructions are not the only operations that have such dependencies. For example, memory operations may be scheduled so that information is stored into a memory location before information is read from that memory location by a later operation. Other examples include scheduling operations based on limited execution resources, memory resources, register resources, slot availability or bus availability. By way of example, the scheduling of micro-operations, also known as xe2x80x9cxcexcopsxe2x80x9d or xe2x80x9cuops,xe2x80x9d will be used herein to describe known scheduling techniques.
FIG. 1 is an overview of a known system for processing instructions and uops. The system includes an instruction fetch and decode engine 110 that decodes an instruction stream into a series of in-order ops that represent the data flow of the instruction stream. The instructions can be decoded, for example, into uops with two logical sources and one logical destination. The uops are xe2x80x9cissuedxe2x80x9d from the instruction fetch and decode engine 110 to a renaming and allocation unit 120. If a processor has only a limited number of physical registers, the renaming and allocation unit 120 maps logical register references to physical register references.
The uops are then sent to a scheduler 130, which stores several pending uops and selects from this group, or xe2x80x9cqueue,xe2x80x9d the uop or uops that will be performed next. The scheduler 130 selects uops such that a child uop will not be performed before its parent uop. That is, the scheduler 130 decides if every source register used by a uop is ready to be used. If all of the uop""s sources are ready, and if execution resources are available, the uop is sent, or xe2x80x9cdispatched,xe2x80x9d to a execution resource 140 where the operation is performed. Thus, uops are dispatched based on data flow constraints and resource availability, not the original ordering of the stream.
Known schedulers are typically based on the xe2x80x9cTomasuloxe2x80x9d scheduler. FIG. 2, a block diagram of such a Tomasulo scheduler, shows two issued uops, Add1 and Add2, that have been received by a scheduler 200. Each uop has two sources and a destination. Add1 sums the contents of register 1 (r1) with the contents of r2. The result is stored in r3. Add2-sums the contents of r3 with the contents of r2 and stores the result in r4. As can be seen, Add2 depends on, and is the child of, Add1. The scheduler 200 includes a ten-bit scoreboard 210 that is used to keep track of which registers are ready. Each bit represents a register, and, for example, a xe2x80x9c0xe2x80x9d indicates that the register is not ready while a xe2x80x9c1xe2x80x9d indicates that the register is ready. If Add1 has not been executed, the bit associated with r3 in the scoreboard 210 is set to xe2x80x9c0,xe2x80x9d indicating that r3 is not ready.
An active scheduler 220 uses the scoreboard 210 to determine if a uop is ready for dispatch. For example, the active 220 scheduler looks at the bits associated with r3 and r2 when considering Add2. If the scoreboard 210 reflects that both sources are ready, the active scheduler 220 dispatches the uop for execution. If either source is not available, the uop is not dispatched. After the uop is executed, the scoreboard 210 is updated to reflect that 4 is now ready.
FIG. 3 illustrates circuitry associated with a Tomasulo scheduler. When a uop is written, or allocated, into the Tomasulo scheduler, its sources are read from the scoreboard 210. If the scoreboard 210 indicates that the sources are ready, the uop is ready to schedule. Sources that are ready in the scoreboard 210 are marked ready in the scheduler. Sources that are not ready will monitor the result bus. The value of a pending uop""s source register 310 is matched against the value of completed uops on the destination, or result, bus using a group of compares 320. The outputs from the group of compares 320 are input to a wide OR 330, and the output of the wide OR is stored as a ready bit 340 for the first source. Similar logic (not shown in FIG. 3) is performed to generate a ready bit for the pending uop""s second source. When all of the pending uop""s sources are ready, as determined by the output of the logic gate 350, the uop is ready for dispatch. This logic is repeated for each pending uop, such as entries 1 to n. If multiple uops are ready to dispatch, priority logic 360 determines which uop will be dispatched. A lookup is performed to determine the destination register 370 of the dispatching uop, and this value is driven on a result bus.
The Tomasulo scheduler uses a xe2x80x9ctightxe2x80x9d scheduling loop as shown in FIG. 4. For each pending uop, the scheduler monitors the result bus and compares the destination of executed uops with the pending uop""s sources at 410. Next, the scheduler performs ready determination logic 420 to determine the dispatch readiness of the pending uop. For every source used by the pending uop, the results of the comparison performed at 410 are ORed at 430. The results for each source are then ANDed at 440. Only if every source is ready does the scheduler determine that the uop is ready for dispatch.
Several uops may be ready for dispatch at one time. If more than one uop is ready, prioritization is performed at 450 to determine which of the ready uops should be dispatched first. Finally, the pending uop is dispatched at 460. When a uop is dispatched, the scheduler repeats the actions described above, resulting in the tight scheduling loop that determines when pending uops are ready for execution.
There are a number of disadvantages, however, to known scheduling techniques. For example, the basic motivation for increasing clock frequencies is to reduce instruction latency. Suppose that a part of a program contains a sequence of N instructions, I1, I2, . . . , IN. This part of the program may also contain any other instructions. Suppose also that each instruction requires, as an input, the result of the previous instruction. Such a program cannot be executed in less time than T=L1+L2+ . . . LN, where Ln is the latency of instruction In, even if the processor was capable of executing a very large number of instructions in parallel. Hence, the only way to execute the program faster is to reduce the latencies of the instructions.
Moreover, when each uop in a stream is dependent on the previous uop, a scheduler must perform one full iteration of the tight scheduling loop for each dispatched uop. This becomes the minimum xe2x80x9clatencyxe2x80x9d of each uop. The latency of a uop may be defined as the time from when its input operands are ready until its result is ready to be used by another uop. Additionally, the speed of an instruction through the multi-stage system shown in FIG. 1 is limited by the speed of the slowest unit, or xe2x80x9cweakest link,xe2x80x9d in the chain.
The speed of a processor in uops-per-second, or S, can be expressed as S=P/L, where P is the average parallelism and L is the average uop latency in seconds. A key advantage of a scheduler is that it increases the value P, which improves the processor""s performance. However, an execution unit is typically able to execute a common uop, such as an add, with a latency that is less than the latency of the tight scheduling loop. Therefore, the use of the scheduler also increases the value of L, which limits the processor""s performance.
For example, comparing the destination register of dispatched uops to all sources of all pending uops may take a long time, such as from 4 to 6 gate operations. Thisxe2x80x94in addition to the ready determination logic which may take 1 or 2 gates, the prioritization which may take another 1 or 2 gates and destination lookup of 2 or 3 gatesxe2x80x94results in a tight loop that takes from 8 to 13 gate operations. Moreover, the scheduler may have to monitor a number of different result buses, which increases the amount of comparing that must be performed. The growing number of registers used in processors, as well as the increasing frequencies of processor operation, make the current system of scheduling operations impractical.
In accordance with an embodiment of the present invention, a jump operation to be scheduled in a processor is received. It is determined if previous jump operations in the stream have not been dispatched for execution, and the received jump operation is scheduled after all previous jump operations in the stream have been dispatched for execution.