1. Field of the Invention
The method and apparatus of the present invention relate to dynamic floating point register management techniques in a superscalar computer system. More particularly the method and apparatus of the present invention relates to reducing processing stalls due to hardware limitations.
2. Art Background
The quest for speed and efficiency of processing in computing systems has led to two architectures that allow faster processing speed by implementing simultaneous execution of multiple instructions: pipelined architectures and superscalar architectures.
When first introduced, microprocessors typically included a central processing unit (CPU) that executed only one instruction at a time. Pipelined architectures were developed based on the observation that any given instruction consists of a given number of execution steps. By breaking the execution of instructions into a number of stages, each stage corresponding to an execution step, instruction throughput is increased because a new instruction is allowed to begin execution before a previous instruction has completed. Therefore, throughput becomes a function of how long it takes to complete a single execution step. Optimally, each stage completes execution in one clock period of the system clock.
Practitioners in the art have found that even greater speed and efficiency can be achieved through the use of superpipelining. In superpipelining, each pipeline is broken down into smaller stages than previously allowed for in a typical pipelining implementation. This type of architecture is also known as extended pipelining. Having a greater number of pipeline stages allows for a corresponding increase in the number of instructions that may be processed in the pipeline at any given time.
By contrast, superscalar architectures provide for multiple instruction launch into multiple instruction execution units. Thus, unlike the earliest processors, superscalar processors allow for parallel instruction execution in two or more instruction execution units. The number of instructions that may be processed is increased with the number of parallel instruction execution units that are implemented in the architecture. Furthermore, it is typical that superscalar architectures employ some form of pipelining to maximize instruction throughput.
Superscalar and superpipelining techniques may be implemented in the same architecture. This combination of superscalarity and superpipelining results in an additional increase in instruction parallelism and instruction throughput.
The design of floating point processing units (FPU) in computer systems having high instruction parallelism is relatively straightforward. The desire for high throughput and low latency is satisfied through careful design of the FPU and the critical data paths. However, there are a number of limitations that occur due to the complexity introduced by the combination of superscalar and superpipelined techniques within a floating point architecture.
First, the relative complexity of different floating point operations results in pipelines of different lengths. Because multiple instruction launch allows instructions that are ordered sequentially in the program order to be executed simultaneously, there is the chance that an instruction of low complexity that occurs later in the program order will complete before a more complex instruction that is positioned earlier in the program order completes. This is called "out-of-order completion". If the instruction that has completed out-of-order is also allowed to alter the state of the machine before the more complex instruction completes, "out-of-order retirement" occurs. Out-of-order retirement is contrary to maintaining the appearance of sequential operation. In order to reflect the sequential ordering of instructions in the program order, when an instruction is referred to herein, it will be related to its position in the program order through the use of two terms. For example, an "older instruction" is an instruction that is positioned earlier in the program order than the current instruction. Similarly, A "younger instruction" is an instruction that is positioned later in the program order than the current instruction. In order to further clarify the sequential nature of the program order, the program order will be referred to as the "instruction sequence".
The problem of differing pipeline lengths leads to the question of how to handle instructions that generate exceptions. Generally, floating point exceptions may be handled in either a precise or imprecise manner. Precise exception handling allows the programmer to know exactly where the error has occurred, and further allows the programmer to correct the error and continue processing without having to abort the program because the appearance of sequential execution of instructions is preserved. Imprecise exception handling, on the other hand, provides only a minimum of information to the programmer, none of which is guaranteed to be correct. This may require aborting the program. Thus, in some applications, precise exception handling is preferred.
Exception handling in the context of uneven pipeline lengths may be explored in reference to the following instruction sequence:
(1) fpdiv f.sub.2, f.sub.1, f.sub.7 PA1 (2) fpstore f.sub.7 PA1 (3) fpadd f.sub.0, f.sub.1, f.sub.2 PA1 (4) fpadd f.sub.3, f.sub.2, f.sub.4 PA1 (5) fpadd f.sub.2, f.sub.5, f.sub.6
Line 1 represents a floating point divide operation having f.sub.2 and f.sub.1 as operands and f.sub.7 as the destination of the result. The next instruction in the sequence is a floating point store operation where the contents of f.sub.7 will be stored to main memory. Typically, the execution of a floating point store operation is deferred until a time when there are no outstanding older instructions. Alternatively, the floating point store operation may be executed such that the contents of the referenced register are read, but the act of storing the result to memory is deferred until there are no older outstanding instructions. Thus, realistically, in a superscalar processor the floating point divide instruction of line 1 and the floating point add of line 3 are likely to be executed simultaneously. It should be noted that the divide operation is much more complex than the add operation and requires more clock periods to complete. Therefore a longer execution pipeline is required for the divide operation. For purposes of illustration, the divide pipeline has four stages and the add pipeline is a single stage pipeline.
In an architecture that does not address the different pipeline lengths, the add operations of lines 3, 4 and 5 will complete and change the state of the machine before the divide operation is completed. If an exception is generated in the divide operation after the third instruction has altered the state of f.sub.2, the program may have to be aborted because the state of the machine prior to the exception is no longer known. Thus, uneven pipeline length can result in imprecise exception handling.
One prior art method that provides for precise exception handling is known as even-length pipelining. In order to prevent out-of-order retirement of instructions, delay elements, each being equal to one pipeline stage, are added to the shorter pipeline in order to make the total pipeline length of all pipelines equal to that of the longest pipeline in the system. After completion of an instruction, a temporary result is propagated through each delay stage. If an older instruction generates an exception, the temporary result is discarded and the state of the machine is unchanged. If the older instruction completes successfully, the younger instruction is allowed to retire and the now final result is written to the file.
However, even-length pipelining artificially increases the latency in the system, resulting in a decrease of instruction throughput. For example, the instructions of lines 4 and 5 must wait until the fpdiv instruction of line 1 has completed before the result from instruction line 3 can be used. Thus, instructions that could be in the pipeline must wait outside the pipeline until the artificial latency introduced by the even length pipelining is resolved. It can therefore be seen that, although preventing out-of-order retirement of instructions is important to providing precise exception handling, out-of-order completion is also desirable in order to increase instruction throughput.
Another prior art method is known that allows for both precise exception handling and the pipelining of pending instructions. It should be noted that in the first prior art method the result of instruction 3 will be known several stages before it is written to the register file. In an alternate method a superscalar superpipelined design will capitalize on this fact by providing a selector circuit at the beginning of each execution pipeline. The result of every stage of every execution pipeline is latched in a temporary result register and fed into the selection logic. The selection logic is then used to select between the output ports of the register file and the result of every execution stage of every pipeline. This permits an instruction that depends upon the result of an older instruction to be launched into an execution pipeline as soon as the temporary result of the older instruction is found. Therefore, launching of the younger instruction does not need to be delayed until the final result of the older instruction has been written to the register file, resulting in increased throughput.
There are several problems associated with the introduction of temporary result registers into the pipelines of the superscalar processor. As the number of pipelines and the number of stages per pipeline is increased, so too must the number of inputs to the selector circuits at the beginning of each execution pipeline increase. This increase in complexity results in wasted area on an integrated circuit chip. Additionally, because the temporary registers and selection logic are on the critical data path of the pipeline, the pipeline length is increased, resulting in increased latency.
Another limitation with respect to superscalar superpipelined floating point architectures is that the instruction parallelism of a system may be underutilized if there are an insufficient number of floating point registers. This is because an artificial dependency is produced when two instructions use the same destination register. For example, in the above instruction sequence, the second instruction must wait until the first instruction and all instructions that depend on the result of the first instruction are resolved before it may overwrite the destination register. In typical prior art systems, such as the Sun Microsystems SPARC.TM. architecture, used by Sun Microsystems, Inc., Mountain View, Calif., the maximum number of available floating point registers is defined in the instruction set architecture. For example, the Sun Microsystems SPARC.TM. instruction set provides sixteen double-precision floating point registers which are physically implemented in a register file. Although this problem is somewhat alleviated by the superpipeline structure, which allows for the simultaneous use of registers within the pipeline, the number of user addressable floating point registers remains a practical limitation in implementation.