In spite of the numerous improvements in the operating speed of computers, there is always a need for computers that operate at higher effective throughput. Improved computational speed has been obtained by increasing the speed with which the computer hardware operates and by introducing parallel processing in one form or another. One class of improvements obtained through parallel processing reduce the delays due to the latency time associated with the computer instructions. For the purposes of this discussion, the latency time is defined as the delay between the initiation of an instruction and the time the instruction is actually executed.
Consider an instruction which references data stored in a specified register. This instruction may require 5 machine cycles to execute. In the first cycle, the instruction is fetched from memory. In the second cycle, the instruction is decoded. In the third cycle, the contents of the register are fetched. In the fourth cycle, the instruction is actually executed, and in the fifth cycle, data is written back to the appropriate location. If one were to wait until the instruction execution is completed, only one instruction would be executed every 5 machine cycles.
The effects of the latency time are reduced in pipelined processors by initiating the processing of a second instruction before the actual execution of the first instruction is completed. In the above example, 5 instructions would be in various stages of processing at any given time. The processor would include 5 processors working in parallel, each processor carrying out one of the 5 tasks involved in executing an instruction. While the data for the oldest instruction is being written back to memory or a register, the next to the oldest instruction would be executed by the execution hardware. The register contents needed for the instruction to executed next would be simultaneously being retrieved by the register hardware, and so on.
In principle, pipelined processors can complete the execution of one instruction per machine cycle when a known sequence of instruction is being executed. Unfortunately, computer programs include branch instructions which interrupt the instruction flow. Consider the instruction sequence
if x=0 then goto newstart PA1 y=z PA1 goto somewhere PA1 newstart: y=k
The first instruction is translated to a branch instruction based on the contents of x. By the time this instruction reaches the execution unit in the pipeline, several additional instructions will have entered the pipeline. However, the computer has no method to determine whether the instruction following the branch should by "y=z" or "y=k" until it actually executes the branch. Thus it is impossible to determine which instructions should be loaded in the pipeline after the branch instruction. Usually, one of the two branch outcomes is assumed to be the correct branch outcome, and the instructions corresponding to the chosen branch outcome are then loaded into the pipeline.
If the wrong sequence of instructions is loaded, the computer must be stalled for a time sufficient to empty and refill the pipeline. Thus, if the instructions corresponding to "y=z" were loaded after the branch instruction and x=0, then the pipeline must be flushed and the instructions corresponding to "y=k" loaded for execution. This delays the execution of the program by a time determined by the number of stages in the pipeline.
Prior art systems have attempted to reduce these delays by predicting the outcome of the branch instruction and then loading the instructions corresponding to the predicted outcome. However, there is no prediction scheme which is 100% accurate. Hence, delays are still encountered.
Branch instructions also cause memory related latency delays. Most modern computer systems utilize cache sub-systems to improve the effective access time to the computer's main memory. The cache consists of a high speed associative memory which stores the most recently used instructions and data. When the processor requests the contents of a particular memory location, the cache processor intercepts the request and checks the cache memory to determine if the requested information is in the cache. If the requested information is in the cache, it is returned to the processor with minimal delay. If, however, the requested information is in the main memory, the processor is stalled while the cache retrieves the information. Since main memory speeds are significantly slower than the cache, such cache "misses" introduce significant delays.
Branch instructions often result in cache misses. A branch often causes the computer to continue operation at a memory location that was far from that of the branch instruction. Caches store the most recently used information and information that is close to this information in the main memory. Hence, if the branch is to a distant location, it is unlikely that the next instruction is in the cache.
Another problem encountered with prior art systems for dealing with branch instructions is the inability of these systems to use data generated in processing a first branch instruction to reduce the processing needed for a second branch instruction. A conditional branch instruction may be viewed as comprising three linked instructions. The first instruction computes the target address, i.e., the address of the next instruction to be executed if the branch is taken. The second instruction computes the branch outcome, the outcome of which determines if the branch is to be taken. The third instruction is the actual transfer of control.
In many cases, a number of branch instructions having the same target address will be present in the program. Hence, in principle, a significant amount of processing time could be saved if the results of the target address calculation from the first instruction could be used in the remaining instructions. Prior art computer architectures do not provide an effective method for accomplishing this; hence, the target address is recomputed for each branch. Similarly, the comparison calculation may determine the outcome of several branches.
Finally, prior art systems only provide a means for executing the computations corresponding to one branch instruction at any given time. One important strategy in reducing the effects of latency times involves moving instructions within the instruction sequence. For example, if the compiler knows that a load operation has a latency delay, the compiler can move other instructions in the instruction sequence so that these instructions are being executed during the latency period. This strategy reduces the effects of the latency delay. Unfortunately, the compiler's ability to fill-in these latency delays by performing computations needed for branch instructions is limited by the inability to complete the entire branch computation. For example, prior systems do not provide an effective means for separating the target address computation from the comparison operation to allow the target address to be computed out of order. At most, prior art systems can work on one branch instruction at a time, and if the information for that branch instruction is not available, the branch cannot be computed ahead of time.
Broadly, it is the object of the present invention to provide an improved computer architecture and instruction set for executing branch instructions.
It is a further object of the present invention to provide a computer architecture which reduces the delays encountered in prior art systems due to the need to flush the pipeline when an unexpected branch outcome occurs.
It is a still further object of the present invention to provide a computer architecture which reduces memory latency times introduced by the execution of branch instructions.
It is yet another object of the present invention to provide a computer architecture in which the computations inherent in executing a branch instruction can be shared by a number of branch instructions.
It is a further object of the present invention to provide a computer architecture in which a plurality of branch instructions can be in different stages of processing at any given time.
It is a still further object of the present invention to provide a computer architecture in which the order, as well as the timing, of the address computation and condition computation may be changed relative to that implied by the ordering of the instructions in the code.
These and other objects of the present invention will become apparent to those skilled in the art from the following detailed description of the invention and the accompanying drawings.