In spite of the numerous improvements in the operating speed of computers, there continues to be a need for computers that operate at higher effective throughput. Improved computational speed has been obtained by increasing the speed with which the computer hardware operates and by introducing parallel processing in one form or another. One class of improvements obtained through parallel processing reduces the delays due to the latency time associated with the computer instructions. For the purposes of this discussion, the latency time is defined as the delay between the initiation of an instruction and the time the instruction is actually executed.
Consider an instruction which references data stored in a specified register. This instruction may require 5 machine cycles to execute. In the first cycle, the instruction is fetched from memory. In the second cycle, the instruction is decoded. In the third cycle, the contents of the register are fetched. In the fourth cycle, the instruction is actually executed, and in the fifth cycle, data is written back to the appropriate location. If one were to wait until the instruction execution is completed, only one instruction would be executed every 5 machine cycles.
The effects of the latency time are reduced in pipelined processors by initiating the processing of a second instruction before the actual execution of the first instruction is completed. In the above example, 5 instructions would be in various stages of processing at any given time. The processor would include 5 pipeline stages working in parallel, each stage carrying out one of the 5 tasks involved in executing an instruction. While the data for the oldest instruction is being written back to memory or a register, the next to the oldest instruction would be executed by the execution hardware. The register contents needed for the instruction to be executed next would be simultaneously retrieved by the register hardware, and so on.
In principle, pipelined processors can complete the execution of one instruction per machine cycle when a known sequence of instruction is being executed. Unfortunately, computer programs include branch instructions which interrupt the instruction flow. Consider the instruction sequence
if x=0 then goto newstart PA1 y=z goto somewhere PA1 newstart: y=k
The first instruction is translated to a branch instruction based on the contents of x. By the time this instruction reaches the execution unit in the pipeline, several additional instructions will have entered the pipeline. However, the computer has no method to determine whether the instruction following the branch should by "y=z" or "y=k" until it actually executes the branch. Thus, it is impossible to determine which instructions should be loaded in the pipeline after the branch instruction. Prior art systems have attempted to reduce branch delays by predicting the outcome of the branch instruction and then loading the instructions corresponding to the predicted outcome. However, there is no prediction scheme which is 100% accurate. Hence, delays are still encountered.
If the wrong sequence of instructions is loaded, the computer must be stalled for a time sufficient to empty and refill the pipeline. Thus, if the instructions corresponding to "y=z" were loaded after the first branch instruction and the branch is taken, then the pipeline must be flushed and the instructions corresponding to "y=k" loaded for execution. This delays the execution of the program by a time that depends on the number of stages in the pipeline.
Branch instructions also cause memory related latency delays. Most modem computer systems utilize cache sub-systems to improve the effective access time to the computer's main memory. The cache consists of a high speed associative memory which stores the most recently used instructions and data. When the processor requests the contents of a particular memory location, the cache processor intercepts the request and checks the cache memory to determine if the requested information is in the cache. If the requested information is in the cache, it is returned to the processor with minimal delay. If, however, the requested information is in the main memory, the processor is stalled while the cache retrieves the information. Since main memory speeds are significantly slower than the cache speed, such cache "misses" introduce significant delays.
Branch instructions often result in cache misses. A branch often causes the computer to continue operation at a memory location that was far from that of the branch instruction. Caches store the most recently used information and information that is close to this information in the main memory. Hence, if the branch is to a distant location and not recently visited, it is unlikely that the next instruction is in the cache.
Another problem encountered with prior art systems for dealing with branch instructions is the inability of these systems to use data generated in processing a first branch instruction to reduce the processing needed for a second branch instruction. A conditional branch instruction may be viewed as comprising three linked instructions. The first instruction computes the target address, i.e., the address of the next instruction to be executed if the branch is taken. The second instruction computes the branch outcome, the outcome of which determines if the branch is to be taken. The third instruction is the actual transfer of control.
In many cases, a number of branch instructions having the same target address will be present in the program. Hence, in principle, a significant amount of processing time could be saved if the results of the target address calculation from the first instruction could be used in the remaining instructions. Prior art computer architectures do not provide an effective method for accomplishing this; hence, the target address is recomputed for each branch. Similarly, the comparison calculation may determine the outcome of several branches.
Finally, prior art systems only provide a means for executing the computations corresponding to one branch instruction at any given time. One important strategy in reducing the effects of latency times involves moving instructions within the instruction sequence. For example, if the compiler knows that a load operation has a latency delay, the compiler can move other instructions in the instruction sequence so that these instructions are being executed during the latency period. This strategy reduces the effects of the latency delay. Unfortunately, the compiler's ability to fill-in these latency delays by performing computations needed for branch instructions is limited by the inability to complete the entire branch computation. For example, prior systems do not provide an effective means for separating the target address computation from the comparison operation to allow the target address to be computed out of order. At most, prior art systems can work on one branch instruction at a time, and if the information for that branch instruction is not available, the branch information cannot be computed ahead of time.
The computer architecture and instruction set taught in U.S. Pat. Ser. No.: 08/058,858 mentioned above provides a significant improvement over the prior art with respect to the above-described problems. The computer system described therein uses a register file, connected to the instruction processor, to facilitate the execution of branch instructions. The register file includes a number of registers. Each register is used to store information needed in executing a branch instruction. Each register includes space for storing a target address of a branch instruction and space for storing a flag having first and second states, the first state indicating that a branch instruction referencing the register should cause the instruction processor to branch to the instruction specified by the target address when an execute branch instruction referencing the register is executed. The second state indicates that the instruction processor should continue executing instructions in the sequential order when an execute branch instruction referencing the register is executed by the instruction processor. The computer system utilizes a "prepare to branch" instruction to assign a register and load it with a target address. Conditional branch instructions are implemented with the aid of a compare instruction which sets the flag in a register referenced by the instruction if a specified condition is met.
A prefetch instruction is sent to a cache memory when a flag in one of the registers is set to indicate that a branch is to be taken. The prefetch does not wait for the ultimate value of the flag. The prefetch instruction causes the cache line containing the target address to be loaded into the cache if the cache line in question is not already in the cache.
The register referenced by prepare to branch, execute branch, and compare instructions is specified by a pointer register. The contents of the pointer register may be set with the aid of a separate instruction. Alternatively, the contents of the pointer register may be set by an execute branch instruction using data stored in the register file.
While this computer architecture significantly reduces the delays caused by branches, it is less than optimum on computers which issue multiple instructions per machine cycle, such as super scalar or VLIW computers. On such computers, two or more instructions are being executed on each machine cycle. Hence, multiple branches can, in principle, be processed at the same time.
Even in architectures that execute only one instruction per machine cycle, the determination of the outcome of a branch can be used to effect actions being taken on a plurality of registers. As noted above, the system described above preferably uses prefetch instructions to reduce the effects of memory latency time. Consider the case in which the computer determines that a particular branch will be taken and there were a number of branches on the alternate path that have already been assigned to other registers. Prefetch instructions for the addresses indicated in these other registers are also being processed by the register control hardware. Since these addresses will not be referenced, it is a waste of computer resources to continue with the prefetching operations. In fact, the movement of the corresponding data into the cache may actually reduce the cache's performance by displacing more useful data from the cache.
In general, the register control hardware will, in effect, have a queue of prefetch instructions that it is feeding to the cache. If it is known that a particular branch is not being taken, any corresponding prefetch instructions that have not yet been issued can, in principle, be removed from the queue thereby increasing the priority of data that is more likely to be accessed through the cache. Unfortunately, the control hardware must trace through a linked list having one entry per register to determined which registers and corresponding prefetch instructions are to be retired by any given branch outcome determination. The time needed to traverse this linked list renders this approach untenable.
Broadly, it is the object of the present invention to provide an improved computer architecture and instruction set for executing branch instructions.
It is a further object of the present invention to provide a computer architecture which more efficiently executes multiple branches from a source program in a single machine cycle.
It is a still further object of the present invention to provide a computer architecture which reduces the number of unnecessary prefetch instructions that reach the cache.
These and other objects of the present invention will become apparent to those skilled in the art from the following detailed description of the invention and the accompanying drawings.