There is known a technology referred to as “multi-core” which integrates a plurality of processor cores into a single processing unit. In particular, a processing unit having a large number of processor cores is also referred to as a many-core accelerator. Patent Document 1 describes an example of a virtual architecture and an instruction set for parallel computing on such multi-core or many-core accelerators. In the virtual architecture, parallel processing is executed on the basis of CTAs (Cooperative Thread Arrays). A CTA is a group of n number of threads which concurrently execute the same program. A plurality of CTAs may operate in parallel. A group of CTAs operating in parallel with each other is referred to as a grid. Inclusion relationships between grids, CTAs and threads are shown in FIG. 23. To each of the grids, CTAs and threads, an ID is assigned. In such a virtual architecture, by using the IDs, different grids, different CTAs and different threads can process different data. The thread IDs, CTA IDs and grid IDs may be defined in a multidimensional manner. In FIG. 23, the thread IDs and CTA IDs are each defined in two dimensions.
For example, when processing one dimensional array data, the CTA IDs and thread IDs are defined in one dimension. In that case, as shown in FIG. 24 (b), a position of data to be processed by each thread, data_idx, can be calculated from the CTA ID (cta_id), the total number of threads included in the CTA (cta_size) and the thread ID (thread_id).
When processing two dimensional matrix data, the CTA IDs and thread IDs are defined in two dimensions. In that case, as shown in FIG. 25 (b), the x and y coordinates of a position data_idx of data to be processed by each thread can be calculated from the x and y values of the CTA ID, respectively, the total number of threads included in the CTA and the thread ID.
In the virtual architecture, each thread also can share data with other threads via a memory. One-to-one correspondence is not necessarily needed between a logical thread and a physical processor core, and a larger number of threads than processor cores may exist. In the virtual architecture, when a larger number of threads or CTAs than processor cores are generated, only some of the generated threads or CTAs are concurrently executed. Further, although threads included in the same CTA operate in coordination with each other, operations of individual CTAs are independent of each other.
Patent Document 2 describes a technology for hiding memory access latency in multithread processing. In the technology, in processing of a plurality of threads consisting of a mixture of arithmetic operation instructions with low latency (delay time) and memory access instructions with high latency, processing of one thread is swapped for processing of another thread after the former's executing a memory access instruction. That is, this technology hides memory access latency by, while waiting for completion of memory access of one thread, executing operations of another thread. An example of operation of a device employing this technology is shown in FIG. 26. In the example in FIG. 26, a thread n executes arithmetic operations i to i+2 sequentially. After that, when the thread n executes memory access (memory load j), while waiting until the memory is loaded, this device swaps the thread n for another thread m. Then, the thread m executes arithmetic operation s to s+1 sequentially. Then, when the thread m executes memory access (memory load t), this device swaps the thread m for the thread n having completed the memory load j. Here, n and m are values of a thread identifier. The i, s, j and t are positive integers and represent a processing order of instructions for arithmetic operation and memory load within each of the threads. The technology described in Patent Document 2 is particularly effective in a process where a large number of threads can be concurrently executed on the same processor. On the other hand, in a process where the number of concurrently executable threads is small, it may often occur that there is no other thread capable of executing operations during a time period to wait for completion of memory access of a thread and, accordingly, the technology described in Patent Document 2 cannot hide the memory access latency.
As one of implementations of the virtual architecture described in Patent Document 1, CUDA (Compute Unified Device Architecture) is described in Non-patent Document 3. In this CUDA, there is an upper limit to the number of concurrently executable CTAs. Because this restriction is independent of the number of threads included in one CTA, when the number of threads in one CTA is small, the total number of whole threads becomes small due to the upper limit of the number of CTAs. Also, the number of threads per processor core becomes small. Accordingly, a device employing CUDA cannot hide memory access latency in a process containing only a small number of threads within each CTA.
Patent Document 1 also describes a device which performs processing using a plurality of CTAs, taking high-definition television image generation as an example. In that case, because the images to be processed are two-dimensional ones, threads and CTAs are defined in two dimensions, as shown in FIG. 25 (a). Each of the threads processes one pixel. Here, the total number of pixels of a high-definition television image exceeds the number of threads that can be processed in a single CTA. Accordingly, this device divides an image into appropriate areas. Then, as shown in FIG. 25 (a), each of the CTAs processes one of the divided areas. As shown in FIG. 25 (b), each of the threads determines a location to read out input data and write in output data (data_idx) using its CTA ID and thread ID. Hereafter, each of processes into which the whole process of an application such as high-definition television image generation is divided, and which are allocated to CTAs, is referred to as a task.
A configuration of a parallel processing device employing such a technology described in Patent Document 1 is shown in FIG. 27. This parallel processing device includes a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit) which is a many-core accelerator. When expressed in terms of functional blocks, this parallel processing device comprises an intra-CTA (per-CTA) thread number setting unit 911, a CTA number setting unit 912, a task division unit 913, a CTA control unit 924, a processing task determination unit 925 and a task execution unit 926. Here, the intra-CTA thread number setting unit 911, the CTA number setting unit 912 and the task division unit 913 are implemented by the CPU. The CTA control unit 924, the processing task determination unit 925 and the task execution unit 926 are implemented by the GPU. The intra-CTA thread number setting unit 911 sets the number of threads included in each CTA, which is referred to as an intra-CTA thread number. As this intra-CTA thread number, for example, set is a value inputted by a user taking into consideration the number of threads processable within one CTA. The CTA number setting unit 912 sets a total number of CTAs, referred to as a total CTA number, using the intra-CTA thread number. In the case of high-definition television image generation, the total thread number equals to the number of pixels and thus is fixed. Therefore, if the intra-CTA thread number is determined, then the total CTA number is determined. The task division unit 913 divides the whole process into tasks in accordance with the intra-CTA thread number, as shown in FIG. 28. The CTA control unit 924 generates threads and CTAs on the basis of the inputted intra-CTA thread number and the calculated total CTA number. The CTA control unit 924 assigns an ID to each of the threads and each of the CTAs and controls their execution. The processing task determination unit 925 and the task execution unit 926 operate with respect to each of the CTAs individually. The processing task determination unit 925 determines a task to be processed by each CTA on the basis of the intra-CTA thread number and the CTA ID of the CTA. The task execution unit 926 executes the task determined by the processing task determination unit 925.
FIG. 29 shows operation of such a parallel processing device employing the technology described in Patent Document 1. First, as shown in FIG. 29 (a), the intra-CTA thread number setting unit 911 sets, for example, a value inputted by a user taking the number of threads processable within one CTA into consideration as the intra-CTA thread number (step S801). Next, the task division unit 913 divides the whole process into tasks in accordance with the intra-CTA thread number (step S802). At that time, the task division unit 913 defines the task numbers as one-dimensional values, as shown in FIG. 28. In FIG. 28, k equals to the number of tasks in the x-direction. The threads within each CTA are defined in two dimensions. Next, The CTA number setting unit 912 sets the total CTA number in one dimension, using the intra-CTA thread number (step S803). Here, the order of executing the steps S802 and S803 may be reversed. Next, the CTA control unit 924 generates thus set number of CTAs and threads. Then, the CTA control unit 924 gives an ID to each of the CTAs and each of the threads (step S804). Next, the CTA control unit 924 controls execution of each of the CTAs and each of the threads (step S805). A process executed in each of the CTAs under such control by the CTA control unit 924 is shown in FIG. 29 (b). Here, the processing task determination unit 925 firstly acquires a CTA ID n (step S806). Then, the processing task determination unit 925 calculates the location of target data in processing of the n-th task executed by each thread in the CTA. Next, the task execution unit 926 executes the nth task in each of the threads (step S807). Here, the steps S801 to S803 are carried out by the CPU. The steps S804 to S805 are carried out by the GPU. The steps S806 to S807 are carried out by the GPU with respect to each CTA.
In cases such as the high-definition television image generation process where operations on all elements are the same and are executed with the same process flow, the parallel processing device may divide the whole process into any size of tasks. Accordingly, the parallel processing device may set the intra-CTA thread number and the total CTA number at any values. Therefore, even when there is restriction on the number of concurrently executed CTAs, the parallel processing device can increase the number of concurrently executed threads by increasing the number of threads per CTA, and thereby can hide memory access latency. For example, the parallel processing device may reduce the number of threads per CTA when it is desirable to increase the total CTA number, and may reduce the total CTA number when it is desirable to increase the number of threads per CTA. For example, considered here is to increase the total CTA number from that in the case of FIG. 28. In this case, as shown in FIG. 30, the parallel processing device reduces the number of threads per CTA by narrowing the area per task. Here, as a result of narrowing of the area per task from that containing 16 pixels to that containing 4 pixels, the number of threads per CTA is decreased from 16 to 4. In this way, the parallel processing device can perform adjustment to increase the total CTA number.
In a further respect, an optimum value of the number of concurrently executed CTAs changes with runtime environment. Accordingly, Non-patent Document 1 describes a method of automatically tuning the total CTA number and the number of threads per CTA in accordance with runtime environment. The technology described in Non-patent Document 1 changes the intra-CTA thread number to various values and measures the respective processing times, and then employs a value of the intra-CTA thread number giving the fastest processing as a final optimum value.
A device configuration of the technology described in Non-patent Document 1 is shown in FIG. 31. The device according to the technology described in Non-patent Document 1 includes an application execution unit 900 which comprises the same functional blocks as that of the parallel processing device shown in FIG. 27, a parameter modification unit 931, an execution time acquisition unit 932 and an optimum parameter selection unit 933. The parameter modification unit 931 outputs several different values of the intra-CTA thread number to the intra-CTA thread number setting unit 911. The execution time acquisition unit 932 measures a time taken to execute an application. The optimum parameter selection unit 933 determines a value of the intra-CTA thread number giving the shortest processing time to be an optimum value.
Operation according to the technology described in Non-patent Document 1 is shown in FIG. 32. If tests on all parameter values, that is, all planned values of the intra-CTA thread number, have not been completed (No at a step S1101), the parameter modification unit 931 sets a new value of the intra-CTA thread number (step S1102). Then, the application execution unit 900 executes the application using the set value of the intra-CTA thread number (step S1103). Then, the execution time acquisition unit 932 measures a time taken to execute the application (step S1104). Then, if the time measured by the execution time acquisition unit 932 is shorter than the execution times for previously tested parameter values (Yes at a step S1105), the optimum parameter selection unit 933 updates the optimum parameter (Step S1106). This device repeats the processes of the steps S1101 to S1106 until tests on all parameter values are completed.
By the way, as a cause of decrease in the operating rate of each processor contained in a many-core accelerator, there is mentioned smallness of the total number of threads required for processing an application. For example, in the above-mentioned example of high-definition television image generation process, there may be a case where the number of pixels to be processed is small. In such a case, the parallel processing device described above cannot suppress decrease in operating rates of the processor cores even if the number of threads per CTA is changed in any way, because the total number of threads never becomes large enough. In this respect, Non-patent Document 2 describes a technology of improving the operating rates of processor cores by merging and thereby executing in parallel a plurality of applications which each require a small total number of threads, as shown in FIG. 33. In FIG. 33, this technology divides each application into tasks with an appropriate size in accordance with runtime environment. Here, it is assumed that, for each of an application A containing 3 tasks and an application B containing 8 tasks, the number of tasks executable in parallel is smaller than that calculated from the processing performance of a many-core accelerator employed here. In that case, if the above-described parallel processing device executes the applications A and B separately using the many-core accelerator, it comes to lower the operating rates of processor cores because of low degree of parallelism. Accordingly, the technology described in Non-patent Document 2 executes the applications A and B in parallel with each other and thereby executes concurrently a larger number of tasks than that of when the applications are executed separately. By this way, this technology can improve the operating rates of processor cores.
A device configuration of the technology described in Non-patent Document 2 is shown in FIG. 34. In FIG. 34, the device according to the technology described in Non-patent Document 2 comprises, in addition to the same functional block configuration as that according to the technology described in Patent Document 1 shown in FIG. 27, an application merging unit 941 and a processing application selection unit 942. The application merging unit 941 is implemented by the CPU. The processing application selection unit 942 is implemented by the GPU with respect to each CTA.
Operation according to the technology described in Non-patent Document 2 is shown in FIG. 35. First, as shown in FIG. 35 (a), the intra-CTA thread number setting unit 911 sets the number of threads per CTA acquired through a user's input or the like (step S801). Then, according to the intra-CTA thread number, the task division unit 913 divides the process of each application into the CTA size number of tasks (step S802). Next, the application merging unit 941 merges a plurality of applications to make them executable in parallel with each other (Step S903). Then, the CTA number setting unit 912 sets a total of values of the number of CTAs required for the respective applications to be the total number of CTAs required for the whole of the applications (step S904). Next, on the basis of thus set intra-CTA thread number and total CTA number, the CTA control unit 924 generates threads and CTAs. Then, the CTA control unit 924 gives an ID to each of the threads and each of the CTAs (step S804). Then, the CTA control unit 924 controls execution of each of the CTAs (step S805). A process executed with respect to each of the CTAs under such control by the CTA control unit 924 is shown in FIG. 35 (b). Here, for each of the CTAs, the processing application selection unit 942 firstly acquires the CTA ID (step S806) and, on the basis of the CTA ID, selects applications to be processed by the CTA (Step S907). Then, on the basis of the CTA ID and the like, the processing task determination unit 925 determines which tasks of the selected applications are to be processed by the CTA. Then, the task execution unit 926 executes the tasks determined by the processing task determination unit 925 in each of corresponding threads (step S908).