Pipeline processing is one of the techniques that improve the performance of computer processors. In the pipeline processing, the execution of each instruction is divided into a plurality of stages, such as fetch, decode, execute, memory access, and so on, and different stages of instructions are executed in parallel. That is to say, in parallel with execution of an instruction at a stage (for example, a fetch stage), another instruction at another stage (for example, a decode stage) is executed.
Ideally, it is preferable that instructions are placed in a pipeline such that there are no idle stages. However, some stages become idle due to many reasons, and if this happens, the utilization of the pipeline degrades. One of the reasons is that a program includes branch instructions indicating conditional branches. When the program encounters a branch instruction, it continues to fetch an instruction at the next sequential address without jumping (not-taken) or jumps to fetch an instruction at a remote address (taken), depending on the result of executing the branch instruction. An instruction to be fetched next to the branch instruction is not determined until the branch instruction exits the execute stage. If the next instruction is placed in the pipeline after the completion of the execution of the branch instruction, some stages may become idle.
To deal with this problem, there is an approach that implements branch prediction techniques in a processor. A branch prediction circuit that is provided as hardware in the processor stores history information about previous branch directions of branch instructions. For example, the branch prediction circuit stores a bit sequence indicating several previous branch directions (“taken” or “not-taken”) to several tens of previous branch directions, for each branch instruction.
When a branch instruction is placed in a pipeline, the branch prediction circuit predicts a branch direction of this branch instruction on the basis of the history information. For example, in the case where there is a high probability that the same branch direction is selected successively, the branch prediction circuit is able to predict that the next branch direction will be the same as several most recent branch directions. In addition, for example, in the case where a branch is alternately taken and not taken with a pattern, the branch prediction circuit is able to predict the next branch direction of the branch instruction according to the pattern.
Once a branch direction is predicted, the instruction in the predicted branch direction is placed next to the branch instruction in the pipeline (speculative execution). If the predicted branch direction matches an actual branch direction, the processor just continues to perform the pipeline processing. If the predicted branch direction is incorrect, on the contrary, the processor removes the instruction placed based on the prediction from the pipeline and places an appropriate instruction. This is a misprediction penalty. Therefore, it may be said that the efficiency of the pipeline processing depends on the accuracy of the branch prediction.
In addition, there is hardware multithreading as another of the techniques that improve the performance of processors. While the instruction sequence of a single thread is executed, a small waiting time may intermittently occur due to various reasons, such as memory access, other than the above-described conditional branches. In Operating System (OS)-level multithreading that involves context switches, such as replacement of register data, it is difficult to reduce the intermittent small waiting time. Therefore, in the situation where a single processor or processor core executes only a single thread at a time, there is a limit on the improvement of the utilization of resources including the pipeline stages and others.
The hardware multithreading enables a plurality of threads to share the resources of a single processor or processor core at the same time. From the processor's standpoint, such threads may be called “hardware threads”. The processor stores data for the plurality of hardware threads in a register of the processor. If a waiting time for a hardware thread occurs, for example, the processor places instructions from another hardware thread in the pipeline to keep the stages of the pipeline busy in the waiting time. In this case, the instructions from the hardware thread and the instructions from the other hardware thread coexist in the pipeline and are executed in parallel. Since context switches are not involved, it is possible to switch between the hardware threads at a high speed.
An OS recognizes that a plurality of threads is executed physically in parallel on this processor or processor core. Therefore, from the OS standpoint, the single processor or processor core that executes the plurality of hardware threads logically appears to be as many processors or processor cores as the number of hardware threads.
By the way, there is an idea that both the hardware multithreading and the branch prediction are implemented in a processor. In such a processor, a plurality of hardware threads may share a branch prediction circuit that stores history information on branch instructions, as one of resources provided in the processor or processor core. The history information may be stored in tabular form. For example, when a branch instruction is executed by one of the plurality of hardware threads, the branch prediction circuit converts the address of the branch instruction into an index of the table with a hash function or another algorithm and updates the entry indicated by the index.
There is the following technique proposed for sharing a branch prediction circuit among threads. While the same code is executed by two threads, a processor operates in a “unified mode” in which all the indexes of a table are shared. While different codes are executed by two threads, on the other hand, the processor operates in a “split mode” in which one table is split into halves, and a half of the indexes are allocated to one thread and the other half are allocated to the other thread. In the split mode, an index of the table is calculated from the address of a branch instruction such that the most significant bit corresponds to a thread identifier, thereby splitting the table into halves. This usage of the table is implemented as hardware in the processor.
Please see, for example, Japanese Laid-open Patent Publication No. 2004-326785.
In a processor that is able to execute a plurality of threads, the plurality of threads may start to run under the same program. This case has a problem on how to share storage space for storing information to be used for branch predication (for example, storage space for storing a table).
When a branch instruction is executed by one thread, the result of executing the branch instruction (for example, a bit indicating “taken” or “not-taken”) is written to the space corresponding to the address of the branch instruction or the like. Likewise, when a branch instruction is executed by another thread, the result of executing the branch instruction is written to the space corresponding to the address of the branch instruction or the like. At this time, since these two threads execute the same program, there is a possibility that these two branch instructions are the same instructions (i.e., have the same instruction address). If so, the same information is updated by the two threads, and therefore both the execution results obtained by the two threads are included in the information.
Branch prediction based on information including all execution results obtained by a plurality of threads has a problem of degrading the accuracy of the branch prediction. For example, in the case where the previous branch directions of a branch instruction were “taken” several times in a row in one thread, it is predictable for the thread that the next branch direction of the branch instruction will be “taken”. In addition, in the case where the previous branch directions of the branch instruction were “not-taken” several times in a row in another thread, it is predictable for the other thread that the next branch direction of the branch instruction will be “not-taken”. However, using information that includes both “taken” and “not-taken” obtained by the two threads makes it difficult to perform such branch prediction.
To deal with this problem, there is an approach that improves the branch prediction circuit in terms of hardware and allocates different storage space to two threads that start to run under the same program. This approach, however, fails to increase the branch prediction accuracy of existing processors.