The present invention relates generally to cache structures in computer systems, and more specifically to cache structures that aid in predicting conditional and unconditional branches.
As computer designers have designed increasingly higher performance implementations of various computer architectures, a number of classes of techniques have been developed to achieve these increases in performance. Broadly speaking, many of these techniques can be categorized as forms of pipelining, caching, and hardware parallelism. Some of these techniques are generally applicable to and effective in the implementation of most types of computer architectures, while others are most appropriate in the context of speeding up the implementations of complex instruction set computers (CISC's).
Due to the nature of typical CISC instruction sets, the processing of each instruction often requires a relatively long sequence of operations to be performed. Lower performance implementations consequently spend a large number of processor cycles performing these operations in a largely sequential, though possibly somewhat overlapped, manner. High performance implementations, on the other hand, often resort to using large degrees of hardware parallelism and pipelining to improve the processing throughput rate of the central processing unit (CPU).
In both cases, the processing latency for each instruction is large; in the latter case, though, the goal is to achieve the appearances of each instruction requiring only one or a few processor/clock cycles to be processed. As long as the processing of successive instructions can be successfully pipelined and more generally overlapped, this goal is achieved. Typically, however, various types of dependencies between neighboring instructions result in processing delays.
A number of techniques are available to reduce or eliminate the impact of these dependencies. One area where this is critical is in the handling of control dependencies, i.e. branching type instructions. In the context of a CISC architecture implementation, the handling of such dependencies is difficult. CISC architecture requires the ability to quickly calculate or otherwise determine the target address of the branch, to quickly resolve the proper path of subsequent instruction processing in the case of conditional branches, and in all cases to then quickly restart the fetching of instructions at the new address. To the extent that these operations cannot be performed quickly, pipeline processing delays result.
Relatively long pipelines, or at least large processing latencies, typical in a high performance CISC implementation, make these operations difficult to consistently speed up. These latencies, in conjunction with inter- and intra-instruction dependencies, result in inherent delays in the performance of these operations.
Various prediction and caching techniques can be applied to minimize the actual impact of these delays on processing throughput. These techniques attempt to consistently and accurately predict the information to be produced by the above operations. Such information may include branch target address, conditional branch direction, and the first one or more instructions at the branch target address. The percentage success rates of these prediction techniques then reduce the effective delay penalties incurred by the above three operations by corresponding amounts. In the extreme and ideal case of 100% success rates, these delays potentially can be eliminated.
Many of these prediction and caching techniques are based on the retention or caching of information from the prior processing of branch instructions. When a branch instruction is encountered again, and information from previous processing(s) of this instruction is still to be found in the prediction cache structure, this cached information is then used to make an intelligent dynamic prediction for the current occurrence of the branch. When no such information is to be found in the prediction cache structure, either a less intelligent static prediction must be made, or normal processing, with the attendant possibility of incurring delays, must be performed.
Past high performance CISC designs have used forms of cache structures to hold various combinations of information to be used in predicting one or more of the three types of information mentioned above. An aggressive all-encompassing design could be envisioned in which a fully associative cache of a few thousand entries was utilized. Each entry would hold a record of the actual target address associated with the last occurrence of the branch; a copy of the first several instructions at this target address; and in the case of conditional branches a history record of the direction taken by each of the past several branch occurrences.
In parallel with the fetching and/or decoding of a branch instruction, the instruction would also be looked up in the branch prediction cache. Generally, this look-up would be based on the fetch address of the branch or a closely related address. As the instruction is being decoded the branch history information would be used to predict the direction of conditional branches; this would determine whether subsequent instruction processing should continue with the instructions sequentially following the branch, or with the sequence of instructions starting at the target address.
Whether the branch is conditional or unconditional, if processing is to continue with the target instruction stream, then the processing of successive instructions would proceed without delay using the branch target instructions from the cache. At the same time fetching of further non-cached instructions immediately would be initiated using the predicted branch target address, plus an appropriate increment.
While this branch prediction design offers the possibility of high prediction rates and fast processing of predicted branches, a large amount of relatively fast hardware would be required for implementation. For most CPU implementations, a design of this scale and scope would be impractical and/or involve a poor cost/performance tradeoff. While the performance potential of a branch prediction scheme of this sort is very desirable, the associated hardware cost is simply too high.
In most instances, the resultant cost and performance do not justify this type of branch prediction scheme, and require a more cost-effective approach.
Typically, a branch prediction cache design incorporated into a high performance CPU implementation would be of a smaller scope. It would not attempt dynamic prediction of all three of the following: target address, target instructions, and conditional branch direction. It would also be of a smaller scale. Prediction success rates and degree of branch processing acceleration are necessarily traded-off to reduce the cost to an acceptable level.