Higher speed processing is demanded for a CPU (Central Processing Unit). And for this, the processing of a CPU has been improved using various technologies. The methods used for this purpose are pipeline processing, a superscalar system which performs parallel processing, and an out-of-order execution system which executes instructions having completed input data with priority, without executing according to the sequence assigned to the program instructions.
The out-of-order execution system is a technology to improve performance of a CPU by executing a subsequent instruction first if data required for the first instruction is not completed and data required for the subsequent instruction processing is completed (e.g. see Patent Document 1).
For example, in the case of the processing instructions in the sequence written in a program, if a first instruction processing 1 is an instruction involving memory access, and a subsequent instruction processing 2 is an instruction which does not involve memory access, then the instruction processing 2 is executed in parallel with the memory access of the instruction processing 1, and the instruction processing 1 is executed after executing the instruction processing 2.
Further multi-thread system for improving processing of a CPU by allowing not a single program but a plurality of programs to run has been proposed (e.g. see Patent Document 2).
In this multi-thread system of allowing a plurality of programs to run, by providing a plurality of sets of programmable resources for a CPU, the system is equivalent to a plurality of CPUs operating under software emulation, therefore a plurality of programs can be executed.
One example of this multi-thread system is a VMT (Vertical Multi-Threading) system. According to this system, only one program can run at a time, but programs can be switched when a long data wait time is generated, or when a predetermined interval time elapses. For the circuit amount used for a VMT system, programmable resources are provided for the number of programs, but the circuit amount to be added is low, because of running only one program at a time, which is easily implemented.
Another example of a multi-thread system is a simultaneous multi-thread (SMT) system which allows a plurality of programs to run simultaneously. Since a plurality of programs run simultaneously, circuit control becomes more difficult and resources increase, compared with the case of allowing a single program to run, but the circuits can be efficiently used since a plurality of programs run at the same time.
Controlling the reservation station for processing the out-of-order execution allows an execution of functions from an entry which is ready for executing the function, with priority.
When the functions are executed by the pipeline processing and a type of instruction requiring a different time for executing a respective function are executed, the reservation station controls execution of the entries, so that the timings to output the result of a respective execution of the function do not overlap.
FIG. 15 is a time chart depicting entry execution control of a reservation station for floating point. In the case of the floating point computing, an execution result of pipeline processing is stored in a result register, and at this time, the reservation station selects an entry to be executed so that the timing to be stored in the result register does not overlap with storing the timing of another execution result.
FIG. 15 depicts a control example of a subsequent instruction when an entry to be executed by the reservation station is an entry which requires 4 cycles for execution (precedent instruction), and the subsequent instruction is an entry which requires 2 cycles for execution.
In FIG. 15, T1 to T7 are cycles, P is a processing to select an entry to be executed from the reservation station, B is a processing to read operand data required for executing a function, X is a processing to store the execution result in the result register in the function execution and last cycle, and U is a processing to store the function execution result in the register update buffer.
The precedent instruction requires 4 cycles: X1, X2, X3 and X4 for executing a function, and the subsequent instruction requires 2 cycles: X1 and X2 for executing a function. When the reservation station selects the precedent instruction in cycle T1, the U processing does not overlap, so the subsequent instruction requiring 2 cycles can be executed at timing T2. At timing T3, however, if the subsequent instruction requiring 2 cycles is executed, the timing T7, to store the execution result in the result register (U processing), becomes the same as the precedent instruction requiring 4 cycles, therefore the subsequent instruction cannot be executed here. At timing T4, the subsequent instruction requiring 2 cycles can be executed.
FIG. 16 depicts an example of executing an entry operating in a single thread by the selection operation of the reservation station under this pipeline control. The meaning of P, B, X and U in FIG. 16 is the same as those in FIG. 15.
FIG. 16 is a time chart when an entry requiring 4 cycles for execution is continuously selected by the reservation station, and in this state, an entry requiring 2 cycles for execution is decoded by the instruction decoder, and then an entry requiring 4 cycles for execution is continuously decoded by the instruction decoder.
The reservation station issues (executes) entries sequentially as an entry becomes ready for execution. When there are a plurality of entries that can be executed at the same time, the entries are selected and executed in the decoded sequence.
Therefore even if an entry is ready for execution of the function, the entry may not become an entry which can be executed depending on the result output timing of the previous entry being executed.
When such a state continues for a long time, it is impossible to execute an entry from the reservation station. In the case of FIG. 16, even if the entry requiring 2 cycles for execution becomes ready for execution and the reservation station attempts to execute it, the execution is impossible since the timing of storing the execution result in the result register becomes the same as that of the precedent instruction requiring 4 cycles.
In the case of a single thread, if a predetermined number of instructions are decoded by the instruction decoder after the state where an entry cannot be issued from the reservation station is generated, the entry having the instruction completion control function become FULL state.
In other words, the instruction cannot be completed because the entry cannot be issued from the reservation station. A subsequent instruction can be executed from the reservation station, but the instruction cannot be completed.
As a result, the entry having the function to control completion of an instruction becomes FULL state, and instructions cannot be decoded by the instruction decoder (instruction decoder stopping state). Since an instruction is not decoded and a new entry cannot be created in the reservation station, the entry which cannot be executed (entry having 2 cycles in FIG. 16) can be executed in cycle T5, for example, and the instruction can be completed.    Patent Document 1: Japanese Patent Application Laid-Open No. 2007-87108    Patent Document 2: Published Japanese Translation of PCT application No. 2006-502504 (WO 2004/034209)
In the case of a simultaneous multi-thread system, on the other hand, when an entry of the reservation station is shared by threads, an entry, among entries being readied for execution of a function, of which result output timing is not the same as that of the precedent entry, is selected and executed from the reservation station as an executable entry, regardless the thread of the entry in the reservation station.
In this simultaneous multi-thread system as well, just like the single thread system, even if an entry is ready for execution of a function, the entry may not become an executable entry, depending on the result output timing of the previous entry being executed. If such a state continues for a long time, executing an entry from the reservation station becomes impossible.
FIG. 17 is an example when an entry requiring 4 cycles for execution is executed continuously from the reservation station, and an instruction requiring 4 cycles for execution is decoded by the instruction decoder in the thread 0, in a state of threads 0 and 1 operating in the simultaneous multi-thread.
In this state, if an entry requiring 2 cycles for execution is decoded by the instruction decoder in the thread 1, and then an entry requiring 4 cycles for execution is continuously decoded in the thread 0, the entry requiring 2 cycles for execution in the thread 1 cannot be executed from the reservation station even if this is attempted, since the timing of being stored in the result register becomes the same as that of the precedent instruction.
In the case of the simultaneous multi-thread system, even if an entry which cannot be executed from the reservation station is generated, the other thread can operate without the resource becoming FULL state, and the instruction decoder does not stop, unlike the case of the single thread system.
In other words, in the case of the simultaneous multi-thread system, the instruction in the thread 0 can be completed even after the instruction is executed, so the instruction decoder can decode the instruction in the thread 0. Hence the thread 0 can continuously operate without stopping.
An entry in the thread 1, however, cannot be executed from the reservation station, so the instruction cannot be completed, and the device enters the hang state.
In other words, in the state where execution from the reservation station is impossible, a state where an instruction cannot be completed for a predetermined period (hang state) is detected as an abnormal state, and the CPU stops operation.