In recent years, a method improving performance of a processor by increasing the number of cores of the processor has been generalized, and the number of cores held by one processor increases (multi-coring). An application of a technology such as a simultaneous multi-threading technology in which plural threads are simultaneously executable in one core (multi-threading) has been advanced. The multi-coring and the multi-threading are advanced, and thereby, the number of threads simultaneously executed on the processor tends to increase.
In the processor simultaneously executing multiple threads, program execution performance is lowered if a cache miss and a TLB (Translation Look-aside Buffer) miss occur frequently. For example, when the number of memory areas (pages) simultaneously accessed by execution of a program is the number of TLB entries or less, the TLB miss does not occur during the memory areas are accessed. However, when the number of memory areas (pages) simultaneously accessed exceeds the number of TLB entries, the TLB miss occurs frequently (TLB thrashing).
In the processor simultaneously executing multiple threads, a cache memory and the TLB are shared by plural threads. Accordingly, when the plural threads are used in a data-parallel mode, there are possibilities in which the number of occurrence times of the cache miss and the TLB miss increases unless a proper data assignment for each thread is performed in consideration of configurations of the cache memory and the TLB. For example, when processes of a program illustrated in FIG. 10A are executed in four threads of a thread 1 to thread 4 in parallel, a part accessed by each thread within a certain period of time disperses in a block assignment and concentrates in a cyclic assignment as illustrated by arrows in FIG. 10B. Namely, the occurrence frequency of the cache miss and the TLB miss increases unless the data is assigned to the thread 1 to the thread 4 by the cyclic assignment.
The processor simultaneously executing multiple threads applies not an SIMD (Single Instruction Multiple Data) but an SPMD (Single Program Multiple Data) as an execution model. In the SIMD, it is necessary to frequently take synchronization between threads. In the SIMD, a synchronization cost becomes large as the number of threads becomes large, and as a result, the execution performance of the program is lowered. On the other hand, in the SPMD, it is not necessary to take synchronization between threads. Accordingly, in the SPMD, it is possible to reduce an idle time (a time when nothing is executed) of the core of the processor by executing the thread which is ready to execute regardless of a data sequence.
In the computer executing the plural threads in parallel, a technology is proposed in which progress of a current thread is controlled in accordance with a waiting time of the current thread until a reference thread reaches a synchronization point when the reference thread whose data is referred to by the current thread does not reach the synchronization point at the time when the current thread reaches the synchronization point, and a size of a quality difference when the current thread executes a process without referring to data at the synchronization point of the reference thread (for example, refer to Patent Document 1).
[Patent Document 1] International Publication Pamphlet No. WO2009/090964
[Patent Document 2] Japanese Laid-open Patent Publication No. 2003-108392
In the processor simultaneously executing multiple threads, the SPMD is applied as the execution model, and therefore, there is a possibility when there are a thread whose progress is fast and a thread whose progress is slow. When a progress difference between threads becomes large, a part of the memory accessed by each thread within a certain period of time disperses even if the data assignment of each thread is performed by a proper assignment such as the cyclic assignment, and there is a possibility in which the number of occurrence times of the cache miss and the TLB miss increases. When the number of occurrence times of the cache miss and the TLB miss increases, the execution performance of programs in the processor is lowered.
The synchronization is taken among all threads so as not to enlarge the progress difference among threads, and thereby, it is possible to suppress the occurrences of the cache miss and the TLB miss caused by the progress difference among threads, but the synchronization cost becomes large if the synchronization is taken among all threads. In the processor simultaneously executing multiple threads, state transition occurs in a software thread as exemplified in FIG. 11. A thread in a waiting state transits to an active state when an idle occurs at an active pool. A thread in the active state is executed while frequently switching assignment to a hardware thread. The thread in the active state transits to a done state when the execution of the process completes. As stated above, in the processor simultaneously executing multiple threads, it is often the case when all threads are not simultaneously the active state, and there is a possibility in which the execution of the program stops, and therefore, it is impossible to take the synchronization among all threads.