This invention relates generally to branch prediction in a computer system, and more particularly to providing asynchronous dynamic millicode entry prediction in a processor.
Millicode is internally licensed handwritten low-level code which has access to privileged instructions, and special support hardware designed into a microprocessor. The use of millicode in a high-performance pipelined microprocessor has several key benefits. For instance, using millicode can remove the burden of supporting complex or non-performance critical instructions in hardware at the sub-software level for backwards compatibility in a way that is performance predictable and acceptable. Millicode may provide workarounds to hardware problems by forcing execution of instructions to use millicode. It also allows generation ‘N’ instructions to be added to the instruction set architecture of a generation ‘N−1’ processor, given access to privileged and restricted hardware resources.
Executing an instruction in millicode appears like a branch to a subroutine, where the subroutine implements the function of the instruction using a multitude of millicode instructions. This allows a complex sequence of instructions to be executed using a single program instruction. Any program instruction that is executed by a millicode routine can be referred to as a millicode entry point (mcentry). Once a processor redirects to a millicode routine as a result of encountering a mcentry instruction, after some number of instructions that comprise the millicoded sequence, there is a millicode end (mcend) instruction that returns the program back to the sequential program instruction following the mcentry instruction.
A processor pipeline can include multiple functional units, with each unit further divided into multiple stages. Highest performance in the processor is generally achieved by keeping as many of the stages active at the same time as possible and avoiding stopping and restarting the pipeline. For example, the pipeline can include separate units for instruction fetching, decoding, execution, and put away. The flow of instructions through the pipeline is referred to as an instruction stream. One approach to handle encountering a mcentry instruction in an instruction stream is to wait for the instruction to reach the point of decode in the processor pipeline. The target address of the millicode subroutine can be computed and a sequential return address saved. Any younger instructions in the processor pipeline are flushed. The processor pipeline is restarted at the computed target address, allowing an instruction stream for the subroutine to flow through the processor pipeline. Once the return point of the subroutine is reached, a similar flush and restart are performed using the return address. A typical structure used to support this is a call return stack. Upon decoding the branching instruction, the return address is pushed onto a call return stack. When the returning instruction is reached for the subroutine, the flush & reset occurs once again. The return address is popped off of the call return stack and used to restart the pipeline.
While this method is functionally sound, there can be large performance degrading gaps created as part of the flush and restart processes, where cycles occur without advancing instructions through the pipeline. If the target address is not cached locally, the delays can be more severe as multiple levels of memory hierarchy are accessed (e.g., level-2 cache, main memory, disk storage, etc.). The use of a call return stack may prohibit accurate asynchronous branch prediction from proceeding beyond the mcentry instruction.
Many high-performance processors contain a branch target buffer (BTB), which stores branch address and target address bits associated with a given branch. This mechanism can be used to enhance the performance of executing subroutines by predicting in advance when a branch to a subroutine will occur, and predicting to where it will return. However, this mechanism does have some limitations. Existing BTB designs lack awareness of being in predicted millicode address space, as well as the ability to save off a return point that can be modified and used as the target of the routine ending branch.
Therefore, it would be beneficial to improve the handling of entry and return from millicode subroutines to enhance prediction capabilities and delays associated with pipeline flushes. Accordingly, there is a need in the art for providing asynchronous dynamic millicode entry prediction in a processor.