1. Field of the Invention
The present invention relates to a technique for dynamic scheduling of a pipelined processor for parallel execution of multiple processes, and more particularly, to a pipelined processing system which executes multiple processes in parallel and arbitrates access to shared resources on a phase of the system clock so as to eliminate the need for complex resource allocation algorithms.
2. Description of the Prior Art
Since the processing speed and hence performance of data processing systems doubles every year or so, engineers are constantly searching for new ways to improve the processing speeds of their systems in order to remain competitive. A typical way to improve processing speed and hence to shorten execution time has been to utilize multiprocessing techniques in which a plurality of processors are operated in parallel. For example, several general purpose processors may be loosely coupled in parallel such that jobs which are to be performed by the respective processors are assigned on a process basis and executed in parallel. An example of a general purpose machine having loosely coupled parallel processors which process data in this manner is the VAX 785. Even faster execution may be obtained by placing several special purpose processors in parallel. Special purpose processors arranged in this manner have improved efficiency because they may be tightly coupled to one another to perform a limited special purpose Job such as geometric transformation of input coordinates or polygon rendering in a computer graphics system. Parallel special purpose processors may also be interleaved for pipelined processing applications as described by Hannah in U.S. Pat. No. 4,789,927 issued Dec. 6, 1988.
Another technique for providing faster execution in a data processing system is to speed up the execution on each processor. In other words, a typical way to improve the processing speed is to increase the frequency of the system clock. Increasing the frequency of the system clock improves performance nearly linearly for typical data processing systems by reducing the cycle time. However, data processing systems can only function as rapidly as their hardware and control process permits, and, as a result, there are limits as to how much the frequency of the system clock may be increased.
Yet another technique for improving execution time in a data processing system is pipelining. Pipelining is an implementation technique in which multiple instructions are simultaneously overlapped in execution. Each step in the pipeline completes a part of the instruction by breaking the work to be done in an instruction into smaller pieces and processing each of those pieces, whereby processing of each instruction piece takes a fraction of the time as processing the entire instruction. Each of these steps is called a pipe stage or pipe segment and is implemented on what is referred to herein as a "pipelined processing circuit". The pipelined processing circuits are connected one to the next to form a pipeline in which the instructions enter at one end, are processed through the respective pipelined processing circuits, and exit at the other end. As known to those skilled in the art, if the pipelined processing circuits process the data at approximately the same speed, the speedup from such pipelining approaches the number of pipe stages. For this reason, pipelining is the key implementation technique used to make fast central processing units (CPUs) and also is the subject of the present invention.
The throughput of a pipeline is determined by how often an instruction exits the pipeline. Because the pipe stages are connected together, all the stages must be ready to proceed at the same time. The time required between moving an instruction a single step down the pipeline is referred to as a machine cycle, where the length of a machine cycle is determined by the time required for the slowest pipe stage because all stages proceed at the same time. The machine cycle is typically one clock cycle, but may be two or more. The clock may also have multiple phases. Moreover, by making the pipe stages approximately equal in length, pipelining yields a reduction in the average execution time per instruction by decreasing the clock cycle time of the pipelined machine. Thus, pipelines can be used to both decrease the clock cycle and to maintain a low CPI.
Execution time in a pipelined system thus may be reduced if the pipe stages are made substantially equal in length and the frequency of the system clock is maximized. However, in a pipelined processing system the true measure of the effectiveness of increasing the frequency of the system clock is the number of clock cycles per instruction (CPI) necessary to process a particular instruction. As will be described below, several techniques have been proposed for minimizing the CPI of a pipelined processing system so as to improve the processing efficiency of the system. Such techniques employ different combinations of parallel processing and fast processor clocks. However, such techniques have typically been based on the needs of a general purpose processor and have not taken advantage of the characteristics of special purpose processors.
Pipelined architectures introduce several problems which must be overcome by a designer if the processing efficiency of a processing system is to be actually improved by pipelining. For example, there are situations, called hazards, that prevent the next instruction in the instruction stream from executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining and may be classified as one of three types: Structural, Data and Control hazards. Structural hazards arise from resource conflicts when the underlying hardware cannot support all possible combinations of instructions in simultaneous overlapped execution, while Data hazards arise when an instruction depends on the results of a previous instruction in a way that it is exposed by the overlapping of instructions in the pipeline. Control hazards arise from the pipelining of branches and other instructions that change the program counter. Such hazards prevent instructions from executing in their designated clock cycles because they make it necessary to stall the pipeline to eliminate the hazard. Such stalls cause a significant adverse effect on performance, for in a pipelined machine there are multiple instructions under execution at once. In other words, a stall in a pipelined machine often requires that some instructions be allowed to proceed, while others are delayed. Typically, when an instruction is stalled, all instructions later in the pipeline than the stalled instruction are also stalled. On the other hand, instructions earlier than the stalled instruction can continue, but no new instructions are fetched during the stall. The instructions thus may not complete in the desired order, thereby creating further problems. Accordingly, much design effort in a pipelined processing system is used to prevent stalls by preventing such hazards or at least by dealing with these hazards as they develop.
As known to those skilled in the art, when a processing system is pipelined, the overlapped execution of instructions requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline. As noted above, a structural hazard results if some combination of instructions cannot be accommodated due to resource conflicts. The most common instances of structural hazards arise when some functional unit is not fully pipelined. Then a sequence of instructions that all use that functional unit cannot be sequentially initiated in the pipeline. Another common way that structural hazards appear is when some resource has not been duplicated enough to allow all combinations of instructions in the pipeline to execute. When a sequence of instructions encounters a structural hazard, the pipeline will stall one of the instructions until the required unit is available. Unfortunately, red, oval of all such structural hazards is unrealistic, for to do so would substantially increase the cost of the processing system and increase the latency of the pipeline. Accordingly, the typical approach has been to account for all structural hazards and take the necessary steps to minimize their effects. Such techniques will be described in more detail below.
As noted above, a major effect of pipelining is to change the relative timing of instructions by overlapping their execution. However, by overlapping the execution of instructions in this manner, data and control hazards are introduced. Data hazards occur when the order of access to operands is changed by the pipeline versus the normal order encountered by sequentially executing instructions. For example, if a second instruction of two adjacent pipelined instructions has a source that is the destination of the first instruction, precautions must be taken to ensure that the second instruction does not access the destination of the first instruction before it has been updated. Unless precautions are taken to prevent such data hazards, the second instruction will read the wrong value and try to use it. Such unpredictable behavior is of course unacceptable.
Generally, a data hazard is created whenever there is a dependence between instructions which are close enough that the overlap caused by pipelining would change the order of access to an operand. For example, in the special case of a rendering processor in a computer graphics system, the vertex data and rendering command can create data hazards. Data hazards also may result when a pair of instructions create a dependence by writing and reading the same memory location. For example, cache misses could cause the memory references to get out of order if the processor were allowed to continue working on later instructions while an earlier instruction that missed the cache was accessing memory. Accordingly, when a cache miss is encountered, the entire pipeline must be stalled, effectively making the instruction that contained the miss run for multiple clock cycles. However, stalls may be partially avoided by rearranging the code sequence to eliminate the hazard causing the stall. Such techniques are called pipelined scheduling or instruction scheduling and have been widely used by those skilled in the art. Such pipeline scheduling typically is quite complex but has been effectively used to keep the CPI on the order of one.
Control hazards can cause an even greater performance loss for a pipeline than data hazards. For example, when a branch is executed, it may or may not change the program counter to something other than its current value plus the length of an instruction. If an instruction is a taken branch, then the program counter is normally not changed until the end of the memory cycle, after the completion of the address calculation and comparison. This means stalling the pipeline for the instruction decode, the execute and the memory access cycles, at the end of which the new program counter is known and the proper instruction can be fetched. This effect is called a control or branch hazard and can be addressed by a technique known by those skilled in the art as branch prediction. Although branch prediction is quite simple, it does not effectively reduce the CPI to less than one.
The above and other problems and features of pipelining have been discussed in detail by Patterson and Hennessy in Chapter 6 of a text entitled Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publishers, San Mateo, Calif., 1990, pp. 250-349, the contents of which are incorporated herein by reference. For example, Patterson et al. teach that data dependencies may be minimized by using software to schedule the instructions to minimize stalls. Such an approach is called static scheduling. On the other hand, a technique known as dynamic scheduling may be used, whereby the hardware rearranges the instruction execution to reduce the stalls. Unfortunately, the advantages which are gained by dynamic scheduling have heretofore come at a significant cost in increased hardware complexity.
For example, scoreboarding is a sophisticated prior art technique for dynamically scheduling around hazards by allowing instructions to execute out of order when there are sufficient resources and no data dependencies. In particular, a scoreboard is used to separate the process of issuing an instruction into two parts, namely, checking the structural hazards and waiting for the absence of a data hazard. Structural hazards can be checked when an instruction is issued; however, if the instructions are to begin execution as soon their data operands are available, the pipeline will have to perform out of order execution. Scoreboarding makes this possible.
The goal of a scoreboard is to maintain an execution rate of one instruction per clock cycle when there are no structural hazards by executing an instruction as early as possible. Thus, when an instruction at the front of an input queue is stalled, other instructions can be issued and executed if they do not depend on any active or stalled instruction. The scoreboard takes full responsibility for instruction issue and execution, including all hazard detection. Taking advantage of out of order execution requires multiple instructions to be in their execution stage simultaneously. This can be achieved with either multiple functional units or with pipelined functional units. Accordingly, a scoreboard acts as a means for resource allocation which checks for hazards and then allocates the instructions accordingly.
FIG. 1 illustrates a pipelined processing system utilizing a scoreboard. As shown, a plurality of registers 100 are provided which are accessed by respective pipelined processing circuits 102-108 to perform pipelined processing functions on the data stored therein. A scoreboard 110 is provided for receiving every instruction and constructing a picture of the data dependencies of the respective instructions. This picture is then used by scoreboard 110 to determine when an input instruction can read its operands and begin execution. If scoreboard 110 decides the instruction cannot execute immediately, it monitors every change in the hardware and decides when the instruction can execute. Scoreboard 110 also controls when an instruction can write its result into its destination register. Thus, all hazard detection and resolution is centralized in the scoreboard 110.
Scoreboard 110 also controls the instruction progression from one step to the next by communicating with the functional units 102-108. However, since there is only a limited number of source operands and result buses to the registers 100, scoreboard 110 must guarantee that the functional units allowed to proceed do not require more than the number of data busses available. In other words, the data busses are treated by the scoreboard as resources which must be allocated. This added complexity often causes the scoreboard 110 to have about as much logic as one of the functional units and, on average, about four times as many data busses as would be required if the pipeline only executed instructions in order. Such complexity is undesirable, and it is desired that relatively simple techniques be developed for efficient pipelined processing. In particular, an alternative to the use of scoreboards for special purpose pipelined processing systems is desired.
In addition to branch prediction and static and dynamic pipeline scheduling, one other prior art technique for improving execution on a single processor merits discussion here. A technique known as superscalar allows the CPI to be decreased to less than one. Since the CPI cannot be reduced below one if only one instruction is issued every clock cycle, superscalar is a technique whereby multiple instructions are issued in a clock cycle for parallel execution. This allows the instruction execution rate to exceed the clock rate. Typical superscalar pipelined processing systems issue a few instructions in a single clock cycle. However, if the instructions in the instruction stream are dependent or do not meet certain criteria, only the first instruction in the sequence will be issued since a hazard has been detected.
Superscalar systems are also extremely complicated and require additional hardware. However, the need for additional hardware may be minimized if the instructions use different register sets and different functional units. Moreover, any contentions may be treated as structural hazards and overcome by delaying the issuance of one of the instructions which causes the contention. The contention can also be eliminated by adding several additional bypass paths. However, these solutions make the superscalar system just that much more complicated.
Superscalar pipelines also suffer from other problems which limit their effectiveness. For example, in a superscalar pipeline the result of a load instruction cannot be used on the same clock cycle or on the next clock cycle. As a result, the next three instructions cannot use the load result without stalling. Accordingly, to effectively exploit the parallelism available in a superscalar pipeline, more ambitious compiler-scheduling or hardware implemented scoreboarding techniques, as well as more complex instruction decoding, must be implemented. Such techniques unduly complicate the processing and are generally undesirable except in the most sophisticated general purpose pipelined processing systems.
Accordingly, although branch prediction, pipeline scheduling and superscalar techniques as described above have effectively lowered the CPI in prior art pipelined processing systems, this improvement has come with great costs in hardware complexity. Moreover, such techniques are generally based on the needs of a general purpose machine. Thus, in order to lower the CPI for a special purpose pipelined processing system, such as a computer graphics system, other techniques for reducing the CPI are desired which have the benefits of the systems described above yet are much simpler and hence easier to implement in a special purpose pipelined architecture. The present invention relates to such a technique.