1. Field of the Invention
The present invention relates to an information processing apparatus, information processing method, and storage medium, which generate instructions for parallel computers for executing reduction processing.
2. Description of the Related Art
In recent years, an approach for improving the processing performance of a computer using a plurality of CPU cores has been made. Especially, a GPGPU (General Purpose Graphics Processing Unit) or GPU (Graphics Processing Unit) computing for controlling a GPU to execute processing other than graphics processing attracts a lot of attention. A GPU has several ten to 1000 or more calculation cores, and the peak performance when all the calculation cores operate is very high. However, in order to exert the high performance of the GPU, a programming technique different from the conventional technique is required. The following description will be given taking CUDA available from NVIDIA Corporation as an example of the GPGPU. Since the CUDA is described in NVIDIA CUDA™ NVIDIA CUDA C Programming Guide Version 3.1.1 Jul. 21, 2010, a detailed description thereof will not be given.
The GPGPU normally operates in an SPMD (Single Program-Multiple Data) manner. Therefore, a single program (kernel) is concurrently executed in respective threads. The calculation performance of the GPGPU is further improved by controlling a larger number of calculation cores to restlessly continue processing. Most of applications require processing for integrating calculation results of respective threads into one after the parallel processing. As popular processing, parallel reduction processing is known. In the parallel reduction processing, a plurality of data are gradually integrated to obtain a processing result. At this time, as the data are integrated, the number of threads which join the parallel reduction processing gradually decreases. That is, since the number of threads which do nothing (idle cores) increases, the processing resources are wasted. An example of the parallel reduction is described in detail in “CUDA Technical Training Volume II: CUDA Case Studies Q2 2008”, and a description thereof will not be given.
Furthermore, in the parallel reduction processing, interthread communications take place. When these communications are made via a shared memory, since a plurality of threads concurrently make communications, access conflicts occur. Since conflicting accesses are processed in turn, and other accesses are waited until processing is complete, a processing speed lowers considerably.
Japanese Patent No. 3311381 discloses a method of compiling a program which runs on a computer including a plurality of calculation units that can operate in parallel. According to the method of Japanese Patent No. 3311381, when the number of registers to be used, which is estimated upon issuance of a certain instruction, is larger than the number of available registers, that instruction is changed to another instruction so as to reduce the number of concurrently active registers.
However, the technique described in Japanese Patent No. 3311381 does not consider any case in which a plurality of cores operate in the SPMD manner like in the GPGPU. That is, according to the technique described in Japanese Patent No. 3311381, a plurality of cores are instructed to operate according to different instructions. However, in the GPGPU which does not perform such operations, the operation speed rather lowers according to the technique of Japanese Patent No. 3311381.