The present invention relates to branch prediction caches, and in particular to methods and apparatus for handling the modification of an instruction in the branch prediction cache or already provided to an instruction pipeline by an instruction farther along the pipeline.
As computer designers have designed increasingly higher performance implementations of various computer architectures, a number of classes of techniques have been developed to achieve these increases in performance. Broadly speaking, many of these techniques can be categorized as forms of pipelining, caching, and hardware parallelism. Some of these techniques are generally applicable to and effective in the implementation of most types of computer architectures, while others are most appropriate in the context of speeding up the implementations of Complex Instruction Set Computers (CISC).
The purpose of creating a high-performance implementation of a CISC architecture system is to achieve the appearance of each instruction having a processing time of one or a few processor/clock cycles. However, such timing typically is only approximated. One of the principal reasons for not achieving a one-cycle processing time is the existence of various types of dependencies between neighboring instructions. The dependencies may result in the occurrence of processing delays.
One critical area in the achievement of a one-cycle processing time is in the handling of control dependencies, i.e. branch-type instructions. In the context of a CISC architecture implementation these instructions tend to be difficult insofar as being able to quickly calculate or otherwise determine the target address of a branch, to quickly resolve the proper path of subsequent instruction processing in the case of conditional branches, and in all cases to then quickly restart the fetching of instructions at the new address. Pipeline processing delays result when these operations cannot be performed quickly.
To minimize the actual impact of these delays on processing throughput, various types of prediction and caching techniques are available. The purpose of a designer in applying these various types of techniques is to always accurately predict the information to be produced by the above operations, i.e. branch target address, conditional branch direction, and the first one or more instructions at the branch target address. The percentage success rates of these prediction techniques then reduce the effective delay penalties incurred by the above three operations in corresponding amounts.
Generally speaking, existing techniques are based on the retention or caching of information from the prior processing of branch instructions. When a branch instruction is encountered again, and information from previous processing of this instruction is still to be found in the prediction cache structure, the cached information is then used to make an "intelligent" dynamic prediction for the current occurrence of the branch. When no such information is to be found in the prediction cache structure, either a "dumber" static prediction must be made, or normal processing, with the attendant possibility of incurring delays, must be performed.
Existing high-performance CISC designs use forms of cache structures to hold various combinations of information with the intention of predicting one or more of the three types of information mentioned above. In an aggressive all-encompassing design each entry holds: a record of the actual target address associated with the last occurrence of the branch; a copy of the first several instructions at this target address; and, in the case of conditional branches, a history record of the direction taken by each of the past couple of branch occurrences.
In parallel with the fetching and/or decoding of a branch instruction, the instruction is also looked up in the branch prediction cache. Generally, this look-up is based on the fetch address of the branch or on a closely related address. As the instruction is being decoded, the branch history information is used to predict the direction of conditional branches. The history information determines whether subsequent instruction processing should continue with the instructions sequentially following the branch, or with the sequence of instructions starting at the target address.
Whether the branch is conditional or unconditional, if processing is to continue with the target instruction stream, then the processing of successive instructions proceed without delay using the branch target instructions from the cache. At the same time, fetching of further non-cached instructions is immediately initiated using the predicted branch target address, plus an appropriate increment.
While this branch prediction design offers the possibility of fast, efficient processing of the predicted branches, the possibility of processing a branch instruction based on erroneous information is introduced. In general, handling mispredicted aspects of processing a branch must be an integral part of the overall central processing unit (CPU) design.
There are subtler issues stemming from the nature of the implemented architecture. In the case of many CISC architectures, one part of a program may modify other parts. These modified program parts are then executed. The result is that the modified image of these instructions, instead of the original image, is then executed.
For some CISC architectures which allow programs to modify itself, this type of programming practice has become an established practice within a significant portion of the existing software base. Consequently, to maintain backward software compatibility, new CPU implementations often must not only implement the direct semantics of the architecture's instruction set, but also maintain the appearances of this expected secondary semantic behavior. In the case of higher performance implementations this can become a significant, and potentially difficult, requirement to satisfy.
When cache-based dynamic branch prediction in incorporated into the design difficulties with high-performance arise. It is possible that the target address of a branch instruction, which might otherwise be a constant value for that instruction instance in memory, may be modified by what is normally an instruction that writes data to memory. Or, for that matter, the branch instruction opcode may be changed to that of a different type of branch instruction or to that of a non-branch type of instruction.
Further, even if none of the bytes of a branch instruction itself are modified, one of the target instructions may be modified. To the extent that these instructions are fetched from main memory after any such modification has taken affect, there is no problem. But if, for example, a copy of the modified instruction is held in a branch prediction cache such as described above, and is fetched from there instead of main memory, then a consistency problem exists.
For both branch and target instruction modifications, the design of a branch prediction cache and associated control circuitry must maintain a sufficient degree of consistency to ensure proper processing of instructions. The maintenance of consistency must encompass not only conventional data/instruction cache consistency, but also consistency with respect to memory store instructions modifying other instructions which are executed shortly thereafter.
The consistency problem is essentially similar to that encountered with more conventional data/instruction cache structures used in high-performance CPU designs. Data writes by the CPU must be appropriately reflected in the state and/or contents of any affected cache entries. But the scope of the problem is more general than this. When other devices within the system, such as Direct Memory Address (DMA) devices or CPU's, modify main memory, the issue of cache/main memory consistency again arises. For a branch prediction cache these other devices are additional sources or causes of inconsistencies which must also be covered.
In extreme "store-into-instruction-stream" cases, such as a modifying instruction immediately followed by a branch and then a modified target instruction, this can be difficult. Particularly for highly pipelined, high-performance CPU designs, implementing this can prove to be expensive in terms of additional hardware circuitry, complex to design, and/or forcing compromise in the overall performance attainable by the design.
Pipelining, particularly the deep pipelining that is common in high-performance implementations of CISC architectures, results in large instruction processing latencies and high degrees of overlap between the processing of successive instructions. Access of a branch prediction cache and usage of the resultant prediction information and target instructions generally occurs early in such pipelines. Execution of memory writes by instructions storing to memory, on the other hand, generally takes place late in such pipelines.
Consequently, actions such as fetching target instructions from a branch prediction cache (BPC) can easily occur before architecturally preceding memory writes modifying such target instructions have actually been performed. Such actions may even occur before the store addresses have been generated, this being the earliest point at which potential consistency problems could be checked for and detected. The result is that consistency must be maintained with respect to not only the explicit contents of the branch prediction cache, but also with respect to the implicit contents associated with instructions currently being processed.
Insofar as these consistency issues apply to target instructions fetched from the cache, they also apply to instructions temporarily stored in and then taken from instruction pre-fetch queues. As a transient form of cache, a pre-fetch queue can lead to similar inconsistencies.
Overall, the need to maintain branch prediction cache and more general fetched instruction consistency with respect to cases of "store-into-instruction-stream", as well as with respect to more generic modifications of main memory blocks, is a difficult problem.