In a parallel computer (information processing) system in which a parallel operation is performed using nodes (information processing apparatuses), a communication library such as a message passing interface (MPI) is used in some cases. The communication library is a library for providing functions relating to communication of nodes, such as group communication for transmitting and receiving pieces of data between the nodes.
As one of examples of the group communication provided by the communication library, a reduction operation has been known. The reduction operation is an operation for aggregating, in one node (root node), results obtained by performing specific operations by using pieces of data included in nodes (leaf nodes).
The reduction operation has implementation forms. As one thereof, there is a method for aggregating, in a root node, results while obtaining, in relay nodes, intermediate results of operations, in such a manner as exemplified in FIG. 17. According to the method illustrated in FIG. 17, there is an advantage that even a large-scale system including a large number of nodes is able to suppress an increase in overall processing time, due to concentration of communication in the root node.
Note that, in the example in FIG. 17, leaf nodes A to D each transmit data to a root node R or one of relay nodes E to G (see a timing t1 and arrows (1)). The root node R or the relay nodes E to G each confirm reception completion of transmitted data and each perform a specified operation on the received data. When operations are completed, the relay node E transmits data thereof to the root node R, and the relay node G transmits data thereof to the subsequent relay node F (see a timing t2 and arrows (2)). In addition, after performing a specified operation on pieces of data received from the leaf node C and the relay node G, the relay node F transmits the pieces of data to the root node R (see a timing t3 and an arrow (3)).
In addition, as a method, based on MPI, for transmitting and receiving pieces of data, non-blocking communication has been known. In the non-blocking communication, a processor in a node returns from a communication function at the time of initiating communication processing. Therefore, it is possible to overlap another arithmetic processing operation with the relevant communication processing before communication is completed. In the non-blocking communication, the node that performs the communication processing confirms completion of the communication processing with proper timing.
By the way, as a technology for contributing to speeding up specific processing, there are an atomic read modify write (ARMW) operation and hardware offload.
The ARMW operation is an operation in which a series of processing operations are performed atomically (while ensuring atomicity). As an example of the series of processing operations, processing operations illustrated as follows are cited.                Reading data from a buffer area (memory area) of a remote node (Read),        Rewriting remote-side area data by performing a specific operation on the read data (remote-side area data) (Modify), and        Writing rewritten data to the relevant buffer area (Write).        
Data (local-side area data) included in a node (local node) serving as an issuing source of an instruction for an ARMW operation (ARMW instruction) may be used for the specific operation performed on the remote-side area data. As exemplified in FIG. 18, the local-side area data is directly described in the ARMW instruction and specified. Note that, in an example of the ARMW instruction illustrated in FIG. 18, the type of instruction is information relating to a series of processing operations to be performed by the relevant ARMW instruction, and address information of the remote-side area data is a value associated with, for example, an address of the remote-side area data serving as an operation target. In addition, the coordinates of the remote node are information relating to the location of the remote node within a system.
The hardware offload is a technology for reducing a CPU load by causing hardware, such as a network interface card (NIC), to perform processing, which is to be usually performed by a central processing unit (CPU). Note that it is possible to cause the hardware offload to include a function of performing the above-mentioned ARMW operation and a function of automatically transmitting, by using completion of a processing operation as a trigger, a preliminarily prepared specific instruction.
Technologies of the related art are disclosed in Japanese Laid-open Patent Publication No. 2006-277635 and Japanese Laid-open Patent Publication No. 2012-252591.