1. Technical Field
The present invention relates generally to dynamic scheduling of instructions in a pipelined processor having multiple execution (functional) units and more specifically relates such dynamic scheduling in processors in which the multiple execution units have varying latencies.
2. Description of Related Art
Pipelining is a commonly employed technique for enhancing the performance of a computer architecture. The simplest form of pipelining, known as linear pipelining, functions like an assembly line. In a linear pipeline, a series of processing stages are connected in cascade. Instructions move from one processing stage to the next in sequence, with each stage in the pipeline operating on a different instruction. For example, in a typical linear pipeline, there may be fetch, decode, execute and writeback stages. The writeback stage will be performing the writeback portion of processing an instruction, while the execute stage is executing the next instruction, the decode stage is decoding the next instruction after that, and so forth. Pipelining speeds the execution of a processor, because it allows multiple instructions to be processed (at least in part) at the same time.
Nonlinear pipelined processors provide similar increases in speed, but are more complex. For example, in a nonlinear pipelined processor, there may be multiple execution units. An execution unit (also called a functional unit) is a portion of a processor that performs one or more operations on data. For example, an arithmetic logic unit, which can perform such operations as addition or subtraction, may be an execution unit. Many pipelined processors employ multiple execution units in xe2x80x9cexecutexe2x80x9d stages of the pipeline. In this way multiple instructions can not only be processed in the pipeline at once, but these instructions can also be executed at once.
If one envisions a pipeline as a highway, then an execution stage that employs multiple execution units is like a portion of the highway that widens from a single lane to multiple lanes. Just as cars on a multi-lane highway can travel side-by-side, instructions may travel side-by-side through multiple execution units. After the instructions pass through the execute stage of the pipeline, the instructions will typically reenter a single pipeline in some particular order. This situation is analogous to cars merging into a single lane as the highway narrows.
These multiple execution units may all be the same. For example, there may be five identical arithmetic logic units employed. Typically, however, the execution units are different and perform different types of operations. For example, two of the units may be arithmetic logic units, while another two of the units are floating-point math units. Different units executing different instructions may take differing amounts of time to complete the execution of each of the instructions. For example, a floating-point math unit performing a division operation will typically take longer to complete its operation than will an arithmetic logic unit performing a logical exclusive-or. The time it takes for an execution unit to perform its task is called the xe2x80x9clatencyxe2x80x9d of the execution unit. When an execution unit has performed the task, the execution unit is said to have xe2x80x9cfinishedxe2x80x9d the instruction.
Having multiple execution units performing different operations with different latencies, however, allows some instructions to complete their execution before others. Returning to the highway analogy, this situation is like one car passing another. By having multiple execution units, simpler instructions move through the pipeline more quickly than more complicated instructions. This increases the overall throughput of the processor.
The situation is not quite that simple, however. Typically, the instructions in a given computer program are not independent of one another. Although instructions can often be reordered for faster execution, there are limits to the ability of a processor to rearrange instructions. This typically occurs when an instruction must use the result of a previous instruction. For example, the operations xe2x80x9cZ=X+Yxe2x80x9d and X=Zxe2x88x926xe2x80x9d will yield different results if executed in different orders, since the result of whichever is executed first is used to calculate the result of the one that is executed second. This is analogous to a family that is traveling along a highway in two cars, where one of the drivers is following the other. If the following car passes the leading car, the following car will not know where to go and may get lost.
Just as it is sometimes necessary to one car in a caravan to wait for another, in a pipelined processor it is sometimes necessary to require one instruction to wait until another instruction has completed its processing before being allowed to execute. This is what is known as xe2x80x9cdynamic instruction scheduling.xe2x80x9d
There are many schemes used for dynamic scheduling, such as Tomasulo""s Algorithm and scoreboarding, which were developed for historical computer systems, the IBM 360/91 computer and the CDC 6600 computer, respectively. Both of these schemes rely on identifying when results of previous instructions become available. These schemes are described in more detail in Kai Hwang, Advanced Computer Architecture, McGraw-Hill (1993), Ch. 6, pp. 288-91.
Once an instruction has been executed by an execution unit (i.e., once the execution unit has finished the instruction), the results must be collected and placed in the pipeline to be passed on to subsequent stages. Collecting the results and processing them through subsequent stages is called xe2x80x9ccompletingxe2x80x9d the instruction. Returning to the highway analogy, the instructions must xe2x80x9cmerge into a single lanexe2x80x9d to keep going through the pipeline. One of ordinary skill will note the dichotomy between xe2x80x9cfinishingxe2x80x9d and xe2x80x9ccompleting.xe2x80x9d An instruction is finished (or an instruction finishes) when the execution unit has performed the instruction; and instruction is completed (or an instructions completes) when the result of the finished instruction is passed to the next stage in the pipeline.
As anyone who has done much driving on an urban expressway will attest, xe2x80x9ccompletingxe2x80x9d an instruction is a non-trivial task. The objective is to maximize throughput of instructions while maintaining a valid ordering of the instructions. Therefore, a need exists for a simple, yet effective means of producing this ordering.
The present invention provides a method, processor architecture, computer program product, and data processing system for determining when an instruction in a pipelined processor should be completed. A xe2x80x9cfinish pipexe2x80x9d is established with stages numbered 1-N, where N is the length of the longest latency, measured in clock cycles. As each instruction is issued to an execution unit, an entry for that instruction is placed within the finish pipe. The entry may contain such information as latency of the execution unit to which the instruction has been issued and an identification of the execution unit to which the instruction has been issued. Each clock cycle, the entries in the finish pipe are advanced by one stage. When an entry has reached the stage corresponding to the latency of its associated execution unit, it becomes mature.
Each clock cycle, the finish pipe is scanned starting at stage N and working backward toward stage 1 to find the entry having the highest-numbered stage of any entry in the finish pipe. If that entry is mature, the entry is removed from the finish pipe and the instructions associated with that entry is allowed to complete. If the entry is not mature, the entry simply advances along with the other entries and the pipe rescanned in the next cycle.