When using a processor to execute a plurality of threads, the practice has been to use a single thread processor of a single thread type to sequentially execute the instructions of one thread each clock cycle. This single thread processor sends instructions from a primary instruction cache memory to an instruction decoder. Further, it registers all instructions decoded by the instruction decoder in a commit stack entry unit (usually abbreviated as “CSE”) and simultaneously registers them at reservation stations (usually abbreviated as “RS”) for controlling execution out of order. It reads out from a register the instructions which can be executed by the priority cycle of the RS's by a buffer cycle so as to load them into the arithmetic logic units and execute the operations by the operation execution cycle.
The results of execution of such operations are stored by a register update cycle in an update buffer where instruction end (“commit”) processing is awaited. Commit processing is performed in order upon receiving reports of end of execution of operations, end of transfer of data from the primary data cache memory, end of branch judgment from the branch prediction mechanism, etc. Further, the processor writes these results by the register write cycle from the update buffer in the register and updates the program counter (usually abbreviated as “PC”) and next program counter (NEXT PC). A single thread processor is usually provided with a performance analysis (usually abbreviated as “PA”) circuit having the function of dynamically analyzing the number of instruction executions and the state of occurrence of other events and the frequency of usage of resources. This performance analysis circuit selects the type of events sent from the parts of the processor by software and counts and stores the selected events. The stored events can be read out by software after the end of analysis and the combination of events used for evaluation of the performance of the processor. A conventional single thread processor registers the instructions of one thread in the CSE, registers commit candidates of one thread selected by a pointer selection circuit representing a head entry of the CSE every clock cycle in a commit scope register, and performs commit processing.
As one indicator of performance of a processor, the value of the CPI (Cycle Per Instruction) representing the average value of the number of clock cycles required for completion of one instruction of a program may be mentioned. This value of CPI is found by dividing the number of cycles by the number of executed instructions. If viewing the value of CPI from the perspective of the commit processing, when a number of instructions can be ended in a simultaneous clock cycle, for example, four instructions can be simultaneously committed, the CPI becomes the result of dividing the number of cycles measured for each commit event, that is, 0 end-op (end of zero operations), 1 end-op (end of one operation), 2 end-op (end of two operations), 3 end-op (end of three operations), and 4 end-op (end of four operations), by the number of executed instructions and cumulatively adding the values. In particular, in the case of 0 end-op, this indicates the commit processing of the head instruction (usually abbreviated as “TOQ” (Top of Queue)) in in-order commit processing was not possible. In this case, commit processing of the next instruction is also not possible, so analysis of 0 end-op and analysis of the factors of the same, that is, EU comp-wait (waiting for completion of operation), BR comp-wait (waiting for completion of branching), FCH comp-wait (waiting for completion of forwarding of data from cache memory), CSE empty (state of nothing registered in the commit stack entry unit), etc. become important. The factors of the CPI and the factors of 0 end-op can all be obtained as events from the commit scope register. Further, for the factors of 0 end-op, events are always obtained limited to one factor for each clock cycle. With a single thread processor, there is only one thread, so the factors of the CPI could be easily analyzed by analyzing the events of the one thread sent out from the commit scope register by the performance analysis circuit and cumulatively adding the factors.
In this regard, to improve the efficiency of use of resources required for execution of instructions by the processor such as the cache memory, pipeline, and arithmetic logic units and draw out the maximum performance of the processor, the technique of “multithreading” is generally known. Multithreading includes “simultaneous multithreading (SMT)” having the function of simultaneously executing a plurality of threads. In this simultaneous multithreading, two or more threads are simultaneously executed and instructions of the threads are registered in the commit stack entry unit. By copying into a commit scope register limited to one or more threads like a single thread the entries of commit candidates of threads alternately selected by the thread selection circuit for example each clock cycle, commit processing is performed. Performance is analyzed by the performance analysis circuit of each thread.
In this simultaneous multithreading, in the same way as the above-mentioned single thread method, it is desirable to analyze the factors of the CPI, that is, 0 end-op (end of zero operations), 1 end-op (end of one zero operation), 2 end-op (end of two operations), 3 end-op (end of three operations), and 4 end-op (end of four operations) and the factors of the 0 end-op (end of zero operations), for each thread. The commit stack entry unit has a plurality of threads registered in it, but the commit scope register has registered in it only the commit candidates of the commit scope register limited to part of the threads selected by the thread selection circuit for each clock cycle. Accordingly, the commit scope register performs commit processing for only the partially limited threads. Further, the events of the selected threads are sent from the commit scope register to the performance analysis circuits. However, in this case, events from the not selected threads are not analyzed. In simultaneous multithreading as well, in the same way as the single thread method, the CPI is analyzed accurately for each thread, so it is necessary to simultaneously analyze the events of all of the threads (first problem).
Further, on the other hand, in simultaneous multithreading, it is desirable to analyze the CPI when combining a plurality of threads in a core comprised of a plurality of threads. In this simultaneous multithreading, by executing a plurality of threads, it becomes possible to improve the efficiency of use with a core over the case of execution of only single threads. As one example, in a clock cycle in which all threads have no instruction commits, the processing as a core also has no instruction commits, but in a clock cycle in which one thread has no instruction commits, if the other threads have for example four instruction simultaneous commits, the processing as a core has four instruction simultaneous commits. Here, in a performance analysis circuit for analysis of the CPI of a core comprised of a plurality of threads, the 1 end-op (end of one operation), 2 end-op (end of two operations), 3 end-op (end of three operations), and 4 end-op (end of four operations) are independent for each thread, so can be accurately analyzed, but 0 end-op (end of zero operations) ends up being detected even when not registered in the commit scope register. Due to this, with this method of analysis, it is not possible to accurately analyze the CPI of processing of a combination of a plurality of threads of a core. Accordingly, in simultaneous multithreading, even for a core comprised of a plurality of threads, to accurately analyze the CPI of all threads, it is necessary to accurately analyze events of 0 end-op (end of zero operations) (second problem).
Here, for reference, the following Patent Literature 1 and Patent Literature 2 relating to conventional multithreading are presented as prior art literature.
Patent Literature 1 discloses a performance monitoring system supporting independent monitoring of performance for each of a plurality of parallel threads supported by a processor.
However, in Patent Literature 1, for example, two parallel threads are executed by VMT (Vertical Multi-Threading) where the active states and inactive states of two parallel threads are switched at different timings. Due to this, two parallel threads are not simultaneously executed like in simultaneous multithreading, so the above problems never occur.
Patent Literature 2 discloses a device and method for changing the selection of instruction threads when selecting instruction threads in a multithread processor. However, Patent Literature 2 does not allude at all to the configuration and operation of a simultaneous multithreading type processor.
Therefore, neither of Patent Literature 1 and Patent Literature 2 can deal with the problems arising due to the conventional simultaneous multithreading.    Patent Literature 1 is Japanese Laid-open Patent Publication No. 10-275100. Patent Literature 2 is Japanese Laid-open Patent Publication No. 2004-326765.
Note that, the configuration of a conventional single thread processor and the problems in simultaneous multithreading will be explained in detail later with reference to the drawings.