An accelerator such as a GPU (Graphics Processing Unit) includes dozens to several thousands of arithmetic processing units or arithmetic cores, and is provided with a SIMD (Single Instruction Multiple Data) unit which is capable of processing multiple data values with a single instruction.
Since a reduction operation which is associative and commutative produces an operation result without being dependent on an operation sequence, the reduction operation may be processed in parallel using, for example, the SIMD unit. Here, the reduction operation is an order of operations which computes plural data values to obtain a single data value, and includes, for example, an addition, a multiplication, a computation of a maximum value, or a computation of a minimum value.
In the meantime, it is desirable to obtain an operation result of data using an atomic process in the reduction operation such as a histogram computation where a next destination-to-operation is determined by the operation result of data. The atomic process is a process in which reading of data from a storage device, executing of an operation such as addition and storing of the operation result of data in storage device, are executed without being interrupted by other processes or threads operating in parallel.
A technology of executing the reduction operation in a processor which is capable of performing the parallel processing has been devised. See, for example, Japanese Laid-open Patent Publication No. 2005-514678. A technology of efficiently executing the reduction operation in which a destination-to-write is changed for each term in the SIMD unit has been devised. See, for example, Japanese Laid-open Patent Publication No. 2011-514598. A scheme utilizing atomic process may be considered in order to execute the reduction operation in which the destination-to-write is changed for each term in a processor which is capable of performing a large-scale parallel processing of which the degree of parallelism is more than several thousands. However, when the accesses to the storage area in which data is recorded are competing, operations of the atomic process are sequentially executed and thus, a data processing efficiency becomes poor.