The present invention relates generally to cache structures in computer systems, and more specifically to cache structures that aid in predicting conditional and unconditional branches.
As computer designers have designed increasingly higher performance implementations of various computer architectures, a number of classes of techniques have been developed to achieve these increases in performance. Broadly speaking, many of these techniques can be categorized as forms of pipelining, caching, and hardware parallelism. Some of these techniques are generally applicable to and effective in the implementation of most types of computer architectures, while others are most appropriate in the context of speeding up the implementations of complex instruction set computers (CISC""s).
Due to the nature of typical CISC instruction sets, the processing of each instruction often requires a relatively long sequence of operations to be performed. Lower performance implementations consequently spend a large number of processor cycles performing these operations in a largely sequential, though possibly somewhat overlapped, manner. High performance implementations, on the other hand, often resort to using large degrees of hardware parallelism and pipelining to improve the processing throughput rate of the central processing unit (CPU).
In both cases, the processing latency for each instruction is large; in the latter case, though, the goal is to achieve the appearances of each instruction requiring only one or a few processor/clock cycles to be processed. As long as the processing of successive instructions can be successfully pipelined and more generally overlapped, this goal is achieved. Typically, however, various types of dependencies between neighboring instructions result in processing delays.
A number of techniques are available to reduce or eliminate the impact of these dependencies. One area where this is critical is in the handling of control dependencies, i.e. branching type instructions. In the context of a CISC architecture implementation, the handling of such dependencies is difficult. CISC architecture requires the ability to quickly calculate or otherwise determine the target address of the branch, to quickly resolve the proper path of subsequent instruction processing in the case of conditional branches, and in all cases to then quickly restart the fetching of instructions at the new address. To the extent that these operations cannot be performed quickly, pipeline processing delays result.
Relatively long pipelines, or at least large processing latencies, typical in a high performance CISC implementation, make these operations difficult to consistently speed up. These latencies, in conjunction with inter-and intra-instruction dependencies, result in inherent delays in the performance of these operations.
Various prediction and caching techniques can be applied to minimize the actual impact of these delays on processing throughput. These techniques attempt to consistently and accurately predict the information to be produced by the above operations. Such information may include branch target address, conditional branch direction, and the first one or more instructions at the branch target address. The percentage success rates of these prediction techniques then reduce the effective delay penalties incurred by the above three operations by corresponding amounts. In the extreme and ideal case of 100% success rates, these delays potentially can be eliminated.
Many of these prediction and caching techniques are based on the retention or caching of information from the prior processing of branch instructions. When a branch instruction is encountered again, and information from previous processing(s) of this instruction is still to be found in the prediction cache structure, this cached information is then used to make an intelligent dynamic prediction for the current occurrence of the branch. When no such information is to be found in the prediction cache structure, either a less intelligent static prediction must be made, or normal processing, with the attendant possibility of incurring delays, must be performed.
Past high performance CISC designs have used forms of cache structures to hold various combinations of information to be used in predicting one or more of the three types of information mentioned above. An aggressive all-encompassing design could be envisioned in which a fully associative cache of a few thousand entries was utilized. Each entry would hold a record of the actual target address associated with the last occurrence of the branch; a copy of the first several instructions at this target address; and in the case of conditional branches a history record of the direction taken by each of the past several branch occurrences.
In parallel with the fetching and/or decoding of a branch instruction, the instruction would also be looked up in the branch prediction cache. Generally, this look-up would be based on the fetch address of the branch or a closely related address. As the instruction is being decoded the branch history information would be used to predict the direction of conditional branches; this would determine whether subsequent instruction processing should continue with the instructions sequentially following the branch, or with the sequence of instructions starting at the target address.
Whether the branch is conditional or unconditional, if processing is to continue with the target instruction stream, then the processing of successive instructions would proceed without delay using the branch target instructions from the cache. At the same time fetching of further non-cached instructions immediately would be initiated using the predicted branch target address, plus an appropriate increment.
While this branch prediction design offers the possibility of high prediction rates and fast processing of predicted branches, a large amount of relatively fast hardware would be required for implementation. For most CPU implementations, a design of this scale and scope would be impractical and/or involve a poor cost/performance trade-off. While the performance potential of a branch prediction scheme of this sort is very desirable, the associated hardware cost is simply too high.
In most instances, the resultant cost and performance do not justify this type of branch prediction scheme, and require a more cost-effective approach.
Typically, a branch prediction cache design incorporated into a high performance CPU implementation would be of a smaller scope. It would not attempt dynamic prediction of all three of the following: target address, target instructions, and conditional branch direction. It would also be of a smaller scale. Prediction success rates and degree of branch processing acceleration are necessarily traded-off to reduce the cost to an acceptable level.
The present invention is an improved branch prediction cache (BPC) scheme that utilizes a hybrid cache structure to achieve a more cost-effective trade-off between hardware cost and performance improvement. The invention combines most of the branch processing acceleration of a full scope prediction cache design with the lower cost of simpler and/or smaller prediction cache structures.
In brief, the invention provides two levels of branch information caching. The first level BPC is a shallow (36 entries) but wide (entries are approximately 32 bytes each) structure which caches full prediction information for a limited number of branch instructions. The second level BPC is a deep (256 entries) but narrow (entries are approximately 2 bytes each) structure which caches only partial prediction information, but does so for a much larger number of branch instructions. As each branch instruction is fetched and decoded, its address is used to perform parallel look-ups in the two branch prediction caches.
Through proper organization of these two cache structures individually, and proper combining and coordination of their joint operation, an overall performance benefit is achieved approximating that of a single cache structure with the prediction information width of the first level BPC and a number of cache entries equal to the second level BPC. The amount of hardware circuitry required to implement the two-level hybrid cache structure is much less than for single deep and wide cache structure.
The first level BPC comprises entries containing a relatively large amount of prediction information. In the preferred embodiment, each first level cache entry contains the target address from when the branch instruction was last executed; up to the first 24 bytes of sequential instruction stream starting at the target address; and two history bits recording the direction taken during the most recent two executions of the branch instruction.
The first level BPC preferably uses an associative access method. A cache tag for each entry, namely the instruction address of the branch associated with the entry, is stored in a content addressable memory (CAM) array. A first level cache look-up is performed by accessing this array using the above next instruction address, and then reading out the prediction information from any, at most one, entry for which there was a tag match. Full associativity maximizes the prediction cache""s hit rate for a given cache size (number of entries).
The second level BPC, unlike the first level BPC, comprises entries containing only a limited amount of prediction information. This is offset by the much larger number of entries, compared to the first level cache, which are capable of implementation with a given amount of hardware circuitry.
The size of a second level cache entry is dramatically reduced by not caching target instructions, while the extent to which branch processing can be accelerated is reduced only moderately. This especially makes sense given that the second level cache serves as a backup to the first level cache. With this much larger size, even given the direct-mapped organization, the second level cache provides an effective backup to the first level cache.
In the preferred embodiment of this invention, a second level cache entry holds only a partial target address and one history bit. The predicted direction of a conditional branch is based simply on the direction last taken by that branch. The branch target address is assumed to be within a subset of the instruction address space also containing the branch instruction. The full predicted target address is a concatenation of the upper address bits from the branch instruction""s address with the lower 16 bits from the cache entry.
The second level BPC uses a direct-mapped access method, versus a set or fully associative method. This is acceptable due to the relatively large number of second level cache entries. Even more significantly, it is then possible to discard the tag and tag storage associated with each cache entry. In essence, when a cache look-up accesses a selected entry, it is simply assumed that the tag and look-up address match.
A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the drawings.