A pipeline processor as a processor and branch prediction will be described. In a simple implementation of a pipelined processor (pipeline processor), a pipeline is stopped (stalled) in clock cycles of pipeline stages from instruction fetch to branch instruction execution when the branch instruction is executed during execution of a program. The efficiency of the pipeline processing is reduced.
To prevent this, when a branch instruction is fetched, the pipeline processor predicts “taken” if an instruction address of the branch target or a branch direction indicates “branched” and predicts “not taken” if the instruction address or the branch direction indicates “not branched”. Alternatively, the pipeline processor predicts both “taken” and “not taken”. The pipeline processor generally executes branch prediction processing of continuing the instruction fetch without waiting for the completion of the execution of the branch instruction. As a result of the branch prediction, the processing efficiency of the pipeline can be improved by continuing issuing instructions to the pipeline of the processor.
The branch prediction includes one or both of branch target prediction and branch direction prediction. The branch target prediction denotes prediction of an instruction address of an instruction (branch target instruction) executed after a branch instruction. The branch direction prediction denotes prediction of whether a branch instruction is branched (taken) or not branched (not-taken). An example of a known branch prediction system includes Japanese Laid-open Patent Publication No. 06-89173.
The system of branch prediction includes static branch prediction with low branch prediction latency and with inferior branch prediction accuracy and dynamic branch prediction with high branch prediction latency and with high branch prediction accuracy. In the static branch prediction, for example, a subsequent address is simply read (assumed “not-taken”) or a prediction bit on an instruction code is used. In the dynamic branch prediction, a past branch history of executed programs is referenced to use regularity of behaviors of branches of the programs to predict one or both of the branch direction and the branch target instruction address based on a branch instruction that is not completed.
In general, a large-scale history is used in the dynamic branch prediction, and the dynamic branch prediction needs more complicated circuits compared to when the dynamic branch prediction is not used and needs a plurality of clock cycles until the completion of the branch prediction. A failure in the branch prediction significantly reduces the performance and the efficiency of power in a processor core with a large number of pipeline steps and simultaneous executions of instructions. A dynamic branch prediction circuit with high prediction accuracy is generally used in the processor core.
Multithreading will be described. A pipelined system with a plurality of threads in a time-division multiplexing manner has emerged around 1960 to efficiently use a computing unit with a long latency.
In a modern computer system, a memory latency, which is time for memory reading and writing, is higher than an operation latency in the processor. A throughput-oriented processor generally adopts hardware multithreading, which is a multithread processing system based on hardware, to conceal the memory latency to improve the throughput of the processor.
The processor core holds one or more sets of hardware threads (also called “strands”) as hardware resources for executing threads, indicating a state of architecture and a state of processor derived from the state of architecture. Therefore, in most cases, the instruction fetch addresses handled in the process by the instruction fetch unit have different values in each thread.
The hardware multithreading is a system in which a single processor core switches a plurality hardware threads in a cycle-by-cycle time division manner or simultaneously processes the plurality of hardware threads within the same cycle.
The definition of hardware threads and multithreading as terms related to micro-architecture and the definition of threads and multithreading as terms in a field related to software, such as an operating system, are usually different. Hereinafter, the meaning described above will be used.
Parallel execution of the instructions of a plurality of threads improves the use efficiency of the computing unit between the plurality of threads or realizes parallel (overlapped) usage of the memory access and the computing unit between the plurality of threads. Therefore, the throughput of the processor improves.
A processor (multithread processor) that processes a plurality of threads by hardware to process the plurality of threads in parallel generally fetches instructions by switching the target threads cycle-by-cycle at a stage of instruction fetch. The time unit of switching the threads for fetching is one clock cycle at the minimum, and there is also a system with larger temporal granularity. The former is generally called FGMT (Fine-grained multithreading) or the like, and the latter is called VMT (Vertical multithreading), CGMT (Coarse-grained multithreading), or the like.
An example of a system of switching the threads (thread scheduling) of the instruction fetch unit includes a round-robin system. This is a method of switching the scheduled threads in a predetermined order in each cycle.
There is also a known data processor that processes a plurality of interleaved instruction threads in cycles according to a priority rule and that adjusts the priority allocated to a specific instruction thread based on an event or a condition related to the instruction threads (for example, see Japanese Patent No. 4179555).
There is also a known data processor that processes a plurality of interleaved instruction threads in cycles according to a priority rule and that executes processing based on an event or a condition related to the instruction threads to select a specific instruction thread (for example, see Japanese Patent No. 4086808).
There is also a known processor capable of processing a plurality of instruction threads, wherein randomization is implemented in a method of interleaving instruction threads for processing, and at the same time, the overall rate or level of interleaving between the instruction threads is maintained at a desired rate or level (for example, see Japanese Patent No. 4086809).
There is also a known computer apparatus including a branch prediction mechanism that can predict an instruction to be executed after a conditional branch instruction to read a sequence of the instruction in advance, the branch prediction apparatus including, in addition to the branch prediction mechanism: means for obtaining a hint of a conditional branch direction; means for acquiring a hint of the conditional branch direction from an execution result of a specific instruction; means for transmitting the hint to the branch prediction apparatus; and means for determining the branch direction according to the hint (for example, see Japanese Laid-open Patent Publication No. 2001-5665).
There is also a known scheduling method in a multithreading processor, wherein a plurality of executable threads are allocated, the number of threads to be executed is dynamically determined according to an operational state of the multithreading processor, the determined number of threads are selected from the plurality of allocated threads, and instructions of the threads selected within the same period are fetched and executed (for example, see International Publication Pamphlet No. WO 04/044745).    Patent Literature 1: Japanese Laid-open Patent Publication No. 6-89173    Patent Literature 2: Japanese Patent No. 4179555    Patent Literature 3: Japanese Patent No. 4086808    Patent Literature 4: Japanese Patent No. 4086809    Patent Literature 5: Japanese Laid-open Patent Publication No. 2001-5665    Patent Document 6: International Publication Pamphlet No. WO 04/044745
Problems of latency will be described. In the pipeline processor that fetches instructions in each thread in a time division manner, the latency from the start of instruction fetch of a thread to acquisition of a result of branch prediction and timing of thread switching in a time division manner may not match. The mismatch in the timing of thread switching occurs when even if a branch prediction result of a thread is acquired, a pipeline stage for receiving the branch prediction result to fetch the instruction is processing another thread. If the pipeline stage is processing another thread, the branch prediction result is not immediately used, and the branch prediction latency is further extended. The number of execution cycles of the thread may increase due to the extension of the branch prediction latency. For example, if thread scheduling of a pure round-robin system is used and if the number of cycles of the branch prediction latency is not a multiple of the number of hardware threads, the number of cycles until the branch prediction result is used for the next instruction fetch is greater than when a single-thread operation is performed, due to the execution of the hardware multithreading operation.
Problems of throughput will be described. The processor core that schedules the threads by the conventional system without the forecast of the branch direction cannot conceal all branch prediction latency by hardware multithreading if the number of cycles (branch prediction latency) from the start of the instruction fetch to the acquisition of the branch prediction result is greater than the number of threads handled by the hardware. Therefore, the branch prediction result obtained by the dynamic branch predictor cannot be immediately used for the instruction fetch, and the pipeline is stalled. Or low-latent, but inaccurate static branch prediction needs to be used, instead of the dynamic branch predictor. The use of the inaccurate branch prediction increases the possibility of unnecessary instruction fetch and reduces the instruction fetch throughput. The reduction in the instruction fetch throughput may also reduce the instruction execution efficiency of the entire processor. In the multithread processor that uses the static branch prediction control for fetching (assumed “not branched”) the subsequent instruction with a consecutive address in case the completion of the dynamic branch prediction is late, the throughput of the instruction fetch pipeline is wasted if the actual execution result of the branch instruction is “branched” when the prediction indicates “not branched”. If the predicted branch direction is branched, the subsequent instruction of the next address is not executed even if the subsequent instruction is fetched. The pipeline processing of the subsequent instruction is canceled to fetch the predicted branch target instruction again, and this is always wasteful. As a result, the throughput performance is reduced in the multithread processor, and the power is wasted.
The processor usually includes storage resources, such as a cache, a TLB (Translation Lookaside Buffer), and a branch history table, the content of which is stored based on the instruction fetch history. The content of the storage resources may be contaminated by inaccurate instruction fetch, and the performance of the processor may be reduced.