As computer designers have designed increasingly higher performance implementations of various computer architectures, a number of classes of techniques have been developed to achieve these increases in performance. Broadly speaking, many of these techniques can be categorized as forms of pipelining, caching, and hardware parallelism. Some of these techniques are generally applicable to and effective in the implementation of most types of computer architectures, while others are most appropriate in the context of speeding up the implementations of complex instruction set computers (CISC's).
Due to the nature of typical CISC instruction sets, the processing of each instruction often requires a relatively long sequence of operations to be performed. Lower performance implementations consequently spend a large number of processor cycles performing these operations in a largely sequential, though possibly somewhat overlapped, manner. High performance implementations, on the other hand, often resort to using large degrees of hardware parallelism and pipelining to improve the processing throughput rate of the central processing unit (CPU).
In both case, the processing latency for each instruction is large; in the latter case, though, the goal is to achieve the appearances of each instruction requiring only one or a few processor/clock cycles to be processed. As long as the processing of successive instructions can be successfully pipelined and more generally overlapped, this goal is achieved. Typically, however, various types of dependencies between neighboring instructions result in processing delays.
A number of techniques are available to reduce or eliminate the impact of these dependencies. One area where this is critical is in the handling of control dependencies, i.e. branching type instructions. In the context of a CISC architecture implementation, the handling of such dependencies is difficult. CISC architecture requires the ability to quickly calculate or otherwise determine the target address of the branch, to quickly resolve the proper path of subsequent instruction processing in the case of conditional branches, and in all cases to then quickly restart the fetching of instructions at the new address. To the extent that these operations cannot be performed quickly, pipeline processing delays result.
Relatively long pipelines, or at least large processing latencies, typical in a high performance CISC implementation, makes these operations difficult to consistently speed up. These latencies, in conjunction with inter- and intra-instruction dependencies, result in inherent delays in the performance of these operations.
Various prediction and caching techniques can be applied to minimize the actual impact of these delays on processing throughput. These techniques attempt to consistently and accurately predict the information to be produced by the above operations. Such information may include branch target address, conditional branch direction, and the first one or more instructions at the branch target address. The percentage success rates of these prediction techniques then reduce the effective delay penalties incurred by the above three operations by corresponding amounts. In the extreme and ideal case of 100% success rates, these delays potentially can be eliminated.
There are various forms of "static" prediction techniques for predicting the direction of a conditional branch. These relatively simple techniques take advantage of the statistical bias that generally exists in the direction taken by different types of branches. Each time a given type of branch is encountered, a fixed prediction is made based on the expected statistic regarding the likelihood of that type of branch being taken.
More sophisticated and generally more successful techniques attempt to retain dynamic information accumulated during the execution of a program. By retaining or caching information from the prior processing of branch instructions, it is possible to make "more intelligent", or statistically more successful, future predictions. Further, by caching appropriate information, it is possible to make worthwhile predictions regarding not only branch directions, but also branch target addresses and target instructions.
When a branch instruction is encountered again, and information from previous processing of this instruction is still to be found in the prediction cache structure, this cached information is then used to make a dynamic prediction for the current occurrence of the branch. When no such information is to be found in the prediction cache structure, either a "blind" static prediction must be made, or normal processing, with the attendant possibility of incurring delays, must be performed.
Past high-performance CISC designs have attempted to perform dynamic prediction of only one or of a subset of the three types of information mentioned above. For each form of dynamic prediction, an appropriate structure is designed to cache the necessary information from past branches. For the handling of the other aspects of a branch, either static prediction or simply normal processing is utilized.
In an aggressive, all-encompassing design one could envision the caching of a wide range of information sufficient to enable relatively successful dynamic prediction of all three types of branch information. Past design approaches have, in essence, traded off the performance potential of this type of design for reductions in the hardware costs and design complexity of branch processing related circuitry.
Incremental improvements upon the branch processing capabilities of a modest design by the addition of dynamic prediction for each aspect of branch instructions can incur large hardware and design complexity costs. Not only are there inherent costs for storing the requisite information from past branches and utilizing it in the processing of future branches, but there are also significant overhead costs.
For each information caching structure there is significant peripheral circuitry necessary for accessing the structure in various ways (e.g. reading and/or writing in indexed and/or associative manners). There is also significant control circuitry for managing the accessing of each structure and the overall operation of these structures in conjunction with the processing of branch instructions. With the inclusion of each further form of branch prediction, the overhead hardware costs are additive and the design costs are at best additive and possibly multiplicative.
While these incremental overhead costs cannot be eliminated, to the extent that they can be reduced (i.e. to have less than additive hardware cost and design complexity impacts), one could shift such a cost-performance trade-off towards greater branch prediction capabilities and thus higher performance.