1. Field of the Invention
The present invention relates generally to scheduling instructions for execution within a digital processor and, more particularly, to ensuring that the processor separates instructions by an appropriate number of clock cycles to allow for the proper handling of exceptions and hazards.
2. Description of the Related Art
Designers of modern processors and computers have increased the operating speeds and efficiency of their designs through a variety of methods and structures. These methods and structures have in part focused on modifying traditional microprocessors that implement "in-order" instruction pipelines. In-order processors usually fetch an instruction stream from the system memory, execute the instruction stream in a sequential program order, and dispatch loads and stores in the sequential program order.
The instructions making up the program code in digital processors typically fall into two broad categories: producer instructions and consumer instructions. Producer instructions, as the term implies, produce results during their execution that consumer instructions in turn use when they execute. In other words, consumer instructions rely on the results of the producer instructions as part of their execution process. For example, suppose a program code instructs the processor to add two numbers, A and B, and to store the sum in location C. The add instruction is a producer instruction; it produces the sum of the numbers A and B and makes that sum available to the store instruction. The store instruction, on the other hand, is a consumer instruction because it relies on the result of the add instruction, i.e., the sum of A and B. In-order processing of the instruction stream ensures that each consumer instruction in the instruction stream observes stores from the producer instructions in the same order. For the same reason, however, the throughput of in-order processors has an inherent limitation.
In more sophisticated modem processors and similar hardware structures, the designers have increased instruction throughput by executing instructions either in parallel or "out-of-order" with respect to the original program or instruction sequence. Out-of-order processors differ from in-order, sequential processors in that they do not execute all instructions in the same order as that of the original instruction sequence. Rather, out-of-order processors employ a variety of techniques aimed at increasing overall efficiency and throughput of the processor by rearranging the order of execution of the original instruction sequence. Out-of-order processors as described here are generally well understood in the prior art.
Frequently, one can improve the operation of an out-of-order processor by making the order of execution of the instructions more flexible. Auxiliary hardware structures and special techniques for treating the instruction sequence help to achieve more flexibility in the instruction execution order. These structures and techniques include: (1) instruction fetchers that employ branch prediction, (2) parallel decoders, (3) large reorder buffers, (4) collision prediction, (5) dependency determination, and (6) renaming of destination registers. Moreover, the structures and techniques for treating the incoming instruction sequence may affect the processor's throughput by synergistic interactions with one another.
Generally, two types of out-of-order processors exist. A first type of out-of-order processor employs a single execution unit. This type of out-of-order processor improves the performance of the execution unit by exploiting the delays associated with the other processing units. For example, this type of processor seeks to keep the execution unit operating during delays inherent in a cache memory's retrieval of data from the main memory.
A second type of out-of-order processor employs multiple execution units. This type of out-of-order processor uses techniques that enhance performance by keeping the execution units operating concurrently as much as possible. The concurrent execution of instructions by multiple processors improves the processor's overall ability to execute instructions in a flexible manner. Flexible execution of instructions may in turn improve the overall performance of the processor and the computer system.
Out-of-order processors typically include a scheduler that has the responsibility of scheduling instructions for execution within the processor's execution unit (or units). Generally, an out-of-order processor fetches an instruction stream from memory and executes ready instructions in the stream ahead of earlier instructions that are not ready. Out-of-order execution of the instructions improves processor performance because the instruction execution pipeline of the processor does not stall while assembling source data (i.e., operands) for a non-ready instruction.
As a part of scheduling instructions, the scheduler determines when an instruction is ready for execution. A ready instruction is typically an instruction that has fully assembled source data and sufficient available execution resources of the appropriate variety (e.g., integer execution unit or floating-point execution unit). A typical out-of-order processor contains several different types of execution units. Each execution unit may generally execute only certain types of instruction. An instruction does not become ready before it has an appropriate execution unit available to it.
Availability of data resources (i.e., operands) also affects when an instruction becomes ready for execution. Some instructions operate on one or more operands. Those instructions will not be ready for execution until their operands become available. Put another way, those instructions typically have a dependency on at least one earlier instruction and cannot execute until the earlier instruction (or instructions) has executed. For example, suppose a first instruction calculates or stores results that a second instruction uses as operands during its execution. In such a scenario, the second instruction cannot begin to execute until the first instruction has executed. Thus, the dependency of the second instruction on the results of the first instruction gives rise to a data dependency of the second instruction on the first instruction.
A processor that implements an out-of-order instruction execution pipeline generates out-of-order result data because it executes the instructions in the instruction stream out-of-order. Although out-of-order execution of instructions provides out-of-order processors with their performance edge over in-order processors, an out-of-order processor must reorder the results of the instructions in the same order as the original program code. Thus, an out-of-order processor may implement a reorder register file or buffer to impose the original program order on the result data after it has executed the instructions out-of-order.
Turning now to the drawings, FIG. 1 illustrates a portion of a traditional out-of-order processor 10. A memory and memory interface unit 12 store the program code and data and provide the processor 10 with access to the memory. The memory and memory interface unit 12 may also include one or more high-speed data and code cache memories (not shown). The cache memories typically contain the contents of recently accessed locations of the memory. Rather than accessing the main memory repeatedly, the processor 10 can retrieve the contents of recently accessed locations from the cache memory. Because of their relatively high speed of operation, the cache memories eliminate the delays associated with accessing the slower system main memory (also contained in the memory and interface unit 12).
An instruction fetch unit 14 includes a fetcher (not shown) that retrieves a sequence of instructions from the memory and memory interface unit 12. The instruction fetch unit 14 sends the retrieved instruction sequence to one or more instruction decode and rename units 16. With the aid of microcode read-only memories (not shown), instruction decoders in the instruction decode and rename unit 16 translate the complex instructions, such as macro-operations (macro-ops), into simpler, hardware-executable micro-operations (micro-ops).
To achieve high efficiency and throughput, the out-of-order processor 10 should ordinarily have the capability of executing the program instructions in any order that keeps the execution unit (or units) 20 continuously busy. For example, executing a second instruction in the original program sequence before a first instruction may enable the processor 10 to avoid a period of inactivity for one of the execution units 20. Instruction dependencies, however, may make the results artificially dependent on the execution order and may interfere with the use of out-of-order execution as a means of improving the efficiency of the processor 10.
To avoid those undesirable dependencies and the resulting loss in processor performance, the processor 10 renames logical destinations to physical destinations as part of reordering the micro-ops for out-of-order execution. To achieve that end, the decoders send a sequence of micro-ops to a renamer (not shown) that resides in the instruction decode and rename unit 16.
The renaming process avoids artificial dependencies created by write-after-write and write-after-read hazards. To understand an example of a write-after-write dependency, assume that two instructions in the instruction stream both write to a register named EAX. Executing the first instruction results in a first value for writing to register EAX. Without register renaming, execution of the second instruction will result in the overwriting of the first value generated by the first instruction, making that value unavailable to any later instructions that require it.
Register renaming allows the removal of the dependency between the first instruction and the second instruction by changing the logical destination EAX of the two instructions to two different physical registers in the processor. As long as the processor 10 has a sufficiently large number of physical registers available, it can execute the two instructions either in-order or out-of-order because the instructions will write their results to two different physical registers. The renamer (not shown), residing in the instruction decode and rename unit 16 of the processor 10, keeps track of this operation.
The renamer reassigns additional physical registers (not shown) to replace the destination registers designated in the micro-ops (i.e., the registers designated for storing the results of various operations). The renamer may also record in a dependency table (not shown) data on the dependencies between the micro-ops of the instruction sequence and on the reassignment of additional physical registers. The renamer may send the micro-ops and their renamed registers to both a reorder buffer (not shown) and to a scheduler 18.
The scheduler 18 has the responsibility of scheduling instructions for execution within the execution units 20. The instructions received from the instruction decode and rename unit 16 form a pool of instructions that the scheduler 18 may assign for execution to one or more execution units 20. To achieve higher efficiency and throughput, the scheduler 18 in an out-of-order processor may assign instructions for execution in a different order than the original order of the instruction sequence fetched by the instruction fetch unit 14 from the memory and memory interface unit 12. To ensure data integrity and to avoid processor failure, the scheduler 18 does not ordinarily assign dependent (i.e., consumer) instructions for execution before the instructions on which they depend (i.e., the producer instructions).
The scheduler 18 may also consult the data dependency table (not shown) residing in the instruction decode and rename unit 16 to determine the instruction dependency information and assignment information on logical and additional physical registers. Based on the consultation, the scheduler 18 may update the dependency and assignment information residing in the dependency table (not shown).
Typically, the scheduler 18 keeps the various items of information pertaining the pool of instructions in a table, where each entry in the table corresponds to one of the instructions in the pool of instructions. FIG. 4A shows an example of such a table for a pool of N instructions. The table entries typically include the ordinary latency (i.e., the number of clock cycles required for the execution of each instruction) of the instructions.
Turning back to FIG. 1, a reorder and retirement unit 22 receives executed instructions from the execution units 20 and reorders the instructions in the same order as the instruction order of the original program code. The reorder and retirement unit 22 and the execution units 20 send the results of the execution of the instructions to the register file unit 24, as appropriate. The register file unit 24 sends the results of the executed instructions to the memory and memory interface unit 12 for storage in the system main memory, as necessary. Those skilled in the art will appreciate that the above operations are well understood in the prior art.
Out-of-order processors often require a minimum number of clock cycles between the execution of certain types of instruction. One reason for this requirement lies in the latency of the instructions (i.e., the time it takes for a given instruction to execute). FIG. 2A illustrates the execution timeline for two consecutive instructions A and B. Instruction A in FIG. 2A is a producer instruction with a latency of two clock cycles. Instruction B is a consumer instruction that uses the results of the execution of instruction A.
To ensure proper operation, the processor must guarantee a minimum of two clock cycles (the latency of the producer instruction) between the start of execution of instruction A and the start of execution of instruction B. If instruction B begins execution before the expiration of at least two clock cycles after instruction A starts to execute, processor failure will result. FIG. 2A shows a situation where instruction B follows instruction A before the expiration of two clock cycles, thus causing incorrect results and processor failure.
Processors typically use some means for tracking the latency of the producers to ensure that no consumer instruction executes before the expiration of a given number of clock cycles after the execution of the corresponding producer instructions ends. Referring back to FIG. 1, the processor 10 may use a latency counter (not shown) for each instruction in the scheduler 18. When it issues producer instruction A (with a latency of two clock cycles) for execution, the scheduler 18 loads the latency counter of the consumer instruction with the latency of instruction A. The latency counter counts down as instruction A executes. The countdown expires when instruction A finishes execution. The scheduler 18 issues instruction B, the consumer instruction, only upon the expiration of the countdown. In this manner, the processor ensures a number of clock cycles equal to at least the latency of the producer instruction between the execution of the producer and consumer instructions. FIG. 2B illustrates a situation where the processor inserts a proper number of clock cycles between the executions of instructions A and B, thus ensuring proper operation of the pipeline.
Situations arising during the typical operation of the processor often require the processor to insert a larger number of clock cycles between producer and consumer instructions than the minimum necessary for proper execution of the instructions. For example, hazards and exceptions make desirable the ability of the processor to selectively and dynamically insert additional clock cycles between the execution of two instructions. A hazard exists when succeeding instructions reference the same storage location because doing so raises the possibility of incorrect operation. For example, a read-after-write (i.e., a read instruction following a write instruction, where both instructions reference the same storage location) operation gives rise to a hazard. To guarantee data integrity and proper operation, the processor must ensure that the consumer read instruction reads the value at the referenced location only after the producer write instruction has written the value to that location. Moreover, processor architecture limitations sometimes make necessary the insertion of additional clock cycles between the two instructions involved.
Exceptions typically arise during floating-point operations. Floating-point producer instructions sometimes produce results that following consumer instructions cannot properly use as operands. For example, a producer instruction that divides a finite number by zero, or divides zero by zero, causes an exception and the result of the instruction cannot serve as a proper operand to a consumer instruction. As those skilled in the art know, exceptions arise from anomalous floating-point operation results known as denumber, not-a-number, and infinities. A floating-point instruction that produces any one of those results causes an exception.
When a floating-point instruction causes an exception, the processor requires additional time to handle the exception (e.g., by executing an exception-handling routine). If the floating-point instruction that causes the exception is a producer instruction, the consumer instructions should not use the results of that instruction because doing so will cause the consumer instructions to generate incorrect results, leading to processor failure. Instead, the consumer instructions should wait for the processor to perform the proper exception handling.
To illustrate with an example, suppose that a consumer store instruction follows a producer floating point instruction, and that both instructions have a latency of two clock cycles. The store instruction simply stores the results of the floating-point instruction at a specified storage location. If the store instruction begins to execute 2 clock cycles after the floating-point instruction starts execution, a processor failure will result if the floating-point instruction causes an exception. Because only 2 clock cycles separate the floating-point instruction from the store instruction, the store instruction will start to execute as soon as the floating-point instruction finishes execution. As a result, the processor will lack sufficient time to handle the exception. FIG. 2C illustrates such a situation.
The processor, however, can avoid functional failure if it has the capability of dynamically inserting additional clock cycles for exception handling between the floating-point instruction and the store instruction. In the example above, suppose further that the processor needs 2 additional clock cycles to properly handle an exception. If the scheduler schedules the store instruction for execution 4 clock cycles after the floating-point begins to execute, the processor will have sufficient time for exception handling. Once it starts to execute, the floating-point instruction will take 2 clock cycles (i.e., its normal latency) to execute. If the floating-point instruction causes an exception, the processor will have available 2 additional clock cycles for exception handling before the store instruction starts to execute. In other words, by inserting 4, instead of 2, clock cycles between the two instructions, the processor can ensure that it has sufficient time for exception handling. FIG. 2D illustrates such a situation.
The above discussion makes clear a need for dynamically and selectively adding latency between producer and consumer instructions for exception handling, hazard handling, or both. As a simple brute-force solution, the processor could add additional clock cycles to all producer instructions. By adding the additional clock cycles, however, the processor would in effect artificially increase the latency of all producer instructions by the number of clock cycles that it needs for proper exception or hazard handling.
As a variant of the simple brute-force technique, the processor could artificially increase the latency of only particular producer instructions. For example, the processor could impose an additional number clock cycles between a floating-point operation and a store operation. Unfortunately, both alternatives have the disadvantage of penalizing the affected instructions and therefore decreasing the overall efficiency and throughput of the processor.
The present invention is directed to overcoming, or at least reducing the effects of, one or more of the problems set forth in the above discussion.