There are cases in which a parallel computing device that is able to execute a plurality of threads in parallel by using a plurality of processors (processor cores, for example) is used. In an example of parallel processing executed by such a parallel computing device, different threads execute the same kind of operation on different input data in parallel, and each of the plurality of different threads generates intermediate data. Next, the intermediate data generated by the plurality of threads is aggregated to obtain resultant data. This parallel processing is sometimes called reduction processing. Among the compilers that generate object codes executed by parallel computing devices, some compilers generate, through optimization, an object code for reduction processing from a source code that has not been created for parallel processing.
In addition, there has been proposed a scheduling method for causing a plurality of processors to execute a plurality of threads in parallel with a shared memory. According to the proposed scheduling method, semaphores, message queues, message buffers, event flags, barriers, mutexes, etc. may be used as a synchronization mechanism for synchronizing the plurality of threads. These kinds of synchronization mechanism are used depending on the class of the threads to be synchronized.
In addition, there has been proposed a compiling device that generates an object code executable by a shared-memory parallel computing device. The proposed compiling device generates an object code that dynamically selects whether to parallelize processing within a loop by using a plurality of threads when the processing is executed. The generated object code calculates a threshold for the number of loops that could improve the execution efficiency on the basis of the number of instruction cycles per processing within a loop and predetermined parallelization overhead information. The object code executes the processing within a loop in parallel when the number of loops determined when the processing is executed is larger than the threshold. Otherwise, the object code sequentially executes the processing within a loop.
In addition, there has been proposed an operation processing apparatus that includes a plurality of cores that is able to execute threads in parallel and a shared memory. A single storage area in the shared memory is accessed exclusively. With the proposed operation processing apparatus, when two or more threads update data in the single storage area, these threads perform reduction processing before accessing the shared memory. In the reduction processing, the intermediate data generated by the threads is aggregated. In this way, the exclusive access to the single storage area in the shared memory is reduced.
See, for example, Japanese Laid-open Patent Publication Nos. 2005-43959, 2007-108838, and 2014-106715.
For example, each of a plurality of threads has previously been provided with a private area (for example, a stack area) on a memory, and the intermediate data generated by an individual thread is stored in the corresponding private area. In a first method for aggregating the intermediate data generated by a plurality of threads, an area for storing resultant data is allocated in a shared area on a memory, and each thread reflects its intermediate data in the resultant data in the shared area. However, according to this first method, since the resultant data is exclusively accessed by a plurality of threads, there is a problem that overhead is caused to perform the exclusive control.
In a second method for aggregating the intermediate data generated by a plurality of threads, each of the threads stores its intermediate data in a shared area, and one of the threads aggregates the intermediate data generated by the plurality of threads. However, according to the second method, an area for storing the intermediate data needs to be allocated in the shared area, in addition to the area for storing the resultant data. While each thread is able to use its private area without regard to the other threads, the plurality of threads share the shared area. Thus, allocation could be managed by control software such as the operating system (OS). Therefore, there is a problem that overhead is caused to allocate the area for storing the intermediate data. In particular, when the intermediate data is variable-length data such as a variable-length array, there is a problem that overhead is caused to dynamically allocate the area.