1. Technical Field
The present invention relates to data processing and in particular to branch prediction in data processing systems. Still more particularly, the present invention relates to a method and system for efficiently handling simultaneous multi-threaded operations within a branch prediction mechanism of a data processing system.
2. Description of the Related Art
Branch prediction within processing systems is well known in the art. When instructions are initially fetched from cache or memory for execution at the processing units, a prediction mechanism within the processing unit predicts a path that will be taken by branch instructions within the group of fetched instructions. The instructions are address operations and the path is identified by an address, referred to as a target address. When the instruction is actually executed, a check is made whether the predictions were correct.
Specific hardware and/or logic structures within the processor carry out the branch direction prediction and subsequent analysis of whether the path was correctly predicted. Some current systems utilize branch prediction logic that includes 3 branch history tables (BHTs) which store predictors for fetched branches, and a predicted target address cache (referred to hereinafter as a “count cache”), which stores predicted target addresses for some of the fetched branch instructions. One BHT, referred to as the “local predictor,” is indexed by partial branch addresses. The prediction direction is associated with the address in the local predictor. The other two BHTs, “global predictor” and “selector,” are indexed by a hash of the partial branch address and recent path of execution. The count cache is utilized for certain types of branch instructions whose target addresses cannot be directly computed from information in the branch instruction itself, by associating target addresses with branch institution addresses.
One improvement in data processing that affects how application instructions are executed by the processor and subsequently the reliability of branch prediction is the implementation of simultaneous multi-threading (SMT). With SMT, program applications executing on the processor are executed as one or more threads. Each thread comprises a stream of instructions. At any given time, information from multiple threads may exist in various parts of the machine. For example, with two executing threads, both threads appear to the OS as two separate processors. Each of the two threads has (or appears to the OS to have) its own copy of all the normal architected registers that a program can access and/or modify.
Often, multiple copies of the same application are executed concurrently in order to speed up the overall processing of the application on the system and ensure more efficient utilization of processor resources. When this occurs, each copy provides its own set of threads, and each thread shares similar program/instruction addresses within the memory subsystem. Branch prediction information (written to the BHTs and count cache) are also the same and can be merged. It is also common, however, for the threads executing on the processor to belong to different application and thus have different program/instruction addresses within the memory subsystem. However, the partial addresses of the instruction stored within the BHTs and the count cache may be similar resulting in some conflict at the BHTs and count cache and accuracy problems with branch prediction.
At the processor level, the addresses utilized during processing are typically effective addresses. Each of these effective addresses map to specific real addresses within the physical memory space. When the instructions are initially retrieved from memory, they are assigned an effective address. A common practice is to begin assignment of lower order bits of effective addresses for each application at a particular addresses to ensure that number of effective addresses required for operations within the processor is not excessively large. The lower order bits of effective addresses are thus utilized and re-utilized for each thread, and threads of different applications with different physical addresses are often assigned the same lower order bits of effective addresses. For example, the compiler may always start a program at the same effective address when it begins lading irrespective of whether another thread (of the same or another program) has been assigned the same effective address. Thus, in the multi-threaded environment, different threads from different applications utilizing processor resources may share the same EA's but because they map to different RAs, the threads necessarily provide very different targets and direction predictions and should not be handled in the same manner when completing way prediction.
Typically the part of the instruction address utilized to index into the BHTs and the count cache are lower order bits, which will tend to be unique for each instruction (or group of instructions in a superscalar machine) of a single application. Each BHT provides an array of 1 or 2-bit wide registers to store the predictors, and the count cache provides an array of registers the width of an instruction address. Assuming the number of lower order instructions address bits used to index into the array is x, the possible register address entries per array is 2x to accommodate all possible addresses. The number of low order instruction bits used to index into the count cache need not be the same as the number of bits used to index into the BHTs.
In SMT mode, two threads share the three BHTs and the count cache. When both threads are running the same code, i.e., threads of the same application, there is an advantage to both threads sharing common BHTs and a common count cache and it is thus important that both threads be able to share BHT and count cache entries. However, when each thread is running different code, the current system by which the threads share common BHTs and common count cache may result in faulty predictions because of the overlap in addresses that may be placed within the BHTs and count cache. Within a multiple application environment this sharing of cache lines would cause some amount of thrashing within the branch prediction mechanism. Currently, there is no implementation in which way branch prediction logic can accurately ensure that prediction from within the BHTs and count cache is not faulty due to the sharing of effective address between threads of different program code.
The present invention thus recognizes that it would be desirable to provide a method, processing system, and branch prediction mechanism that substantially, eliminate faulty predictions caused by SMT operations for different program code. A method, processing system and branch prediction mechanism that enables correct way-prediction when threads of different applications share lower order effective address bits but map to different real addresses would be a welcome improvement. The invention further recognizes that it would be beneficial to provide each thread in a SMT processor the protection of its own private BHTs and count cache spaces, inaccessible to the other thread, without substantially increasing hardware or logic costs (i.e., by sharing current hardware in a non-overlapping way). These and other benefits are provided by the invention described herein.