1. Field of the Invention
The present invention relates to a communication control device, an information processing device, and a computer program product.
2. Description of the Related Art
In a conventional technology, a personal computer (PC) cluster and a parallel computer perform communication by using a message switching system. In a parallel application in particular, widely used communication systems employ message passing libraries as represented by Message Passing Interface (MPI) which is a library for message passing in a distributed-memory type parallel processing.
Various types of network interfaces for PC clusters have been used in combination with the MPI in various manners; such as, Infiniband HCA by Mellanox Technologies, Inc., Myrinet by Myricom, Inc., and QsNET II by Quadrics, Inc. (for example, see “Performance comparisons of MPI implementations over InfiniBand, Myrinet and Quadrics”, IEEE Proceedings of SC '03, November, 2003).
One prototype-level example reported in Japan implements the MPI on RHiNET-2 (for example, see “Performance evaluation of RHiNET-2 network for distributed parallel processing”, Symposium on Advanced Computing Systems and Infrastructures (SACSIS) 2003, ISSN 1344-0640, May, 2003). Furthermore, it has been reported that the MPI is implemented on the DIMMnet-2 that is plugged into a memory slot (for example, see “Implementation of MPI-2 communication library on DIMMnet-2”, study report by Information processing society of Japan special interest group on computer architecture, ISSN 0919-6072, February, 2006).
The reported systems are configured so that data received from a network is stored temporarily in a memory on a network interface board or in a buffer formed in a reserved area, which is not intended for swapping out, of a main memory in a host computer.
In a parallel system, there is no guarantee that a receiver always activates a corresponding receive function before receiving a message, or that the receiver always receives messages from a plurality of transmitters via a network in a desired order. Therefore, in message passing libraries such as the MPI, when the receive function is executed, first, a desired message is retrieved from a buffer called “unexpected message queue”. A message that does not correspond to a receive key is removed from the current buffer to a different buffer until the desired message is found.
The DIMMnet-2 performs a first-in first-out (FIFO) operation by using an IPUSH mechanism that is a pointer controlled by hardware. Furthermore, the DIMMnet-2 writes data in receive buffers that are selectively used depending on a source group. With such an operation, the DIMMnet-2 avoids interposition of firmware in the receiver. Moreover, it has been reported that, in the DIMMnet-2, a communication latency of the MPI is reduced by improving a success probability for retrieval from the buffers (for example, see “Implementation of packet receiving mechanism supporting for message passing model”, study report by Information processing society of Japan special interest group on computer architecture, ISSN 0919-6072, November, 2005).
On the other hand, some conventional examples support the speeding up of the message retrieval from the unexpected message queue in the MPI by hardware. Specifically, a large number of logic blocks each called “ALPU” that is a random logic and includes a comparator and a register are connected like a shift register. With this arrangement, an entry that matches a key can be extracted from the middle of the unexpected message queue whereby the speeding up of the message retrieval is achieved (for example, see “A hardware acceleration unit for MPI queue processing” by K. D. Underwood, K. S. Hemmert, A. Rodrigues, R. Murphy, and R. Brightwell, 19th International parallel and distributed processing symposium, IPDPS '05, April, 2005).
Furthermore, it has been reported that, in an LHS mechanism, a first part (first half) and a second part (second half) of a message are separately stored in memories having different properties. Information required for matching in the MPI is likely to be contained in the first part of the message. By storing the first part separately in a specific memory, such information can be taken into a host device with a low latency, whereby the communication latency of the MPI is reduced (for example, see “Support function for MPI on DIMMnet-3 network interface”, study report by Information processing society of Japan special interest group on computer architecture, ISSN 0919-6072, July, 2006).
However, the buffer-retrieval operations as described above is accompanied by a large amount of copy operations performed between memories by software unless the operation is supported by hardware as described in “A hardware acceleration unit for MPI queue processing” by K. D. Underwood, K. S. Hemmert, A. Rodrigues, R. Murphy, and R. Brightwell. Therefore, unless the receiver receives the message in a desired order, the latency in message reception increases.
If the memory arranged on the network interface board has a small capacity like Myrinet, for example, the memory on the board does not have a capacity to form all of receive buffers of the MPI. Therefore, it is necessary to immediately send a message received from a network to a main memory of a host device via an I/O bus such as a PCI bus. If the message remains on the network interface board, the network gets blocked with messages, resulting in congestion of the network. Because the messages are continuously sent to the receive buffers of the MPI arranged in the main memory of the host device via the PCI bus, or the like, by repeating a direct memory access (DMA) transmission several times, the communication latency is increased.
If, like the DIMMnet-2, the memory arranged on the network interface board is a dynamic random access memory (DRAM) based memory and the memory has a capacity as large as that of the main memory of the host device, data to be remotely accessed from the network can be arranged in the memory on the network interface board. Furthermore, all of the receive buffers of the MPI can be arranged in the memory on the network interface board. Therefore, a received message can be stored in the buffer of the MPI arranged in the memory on the network interface board when the receiver does not activate a corresponding receive function before receiving the message.
However, because it takes longer for the host device to access the memory on the network interface board than the main memory of the host device, a retrieval time of a message corresponding to a receive key can be longer. Therefore, it is difficult to reduce a receive latency of the MPI.
Furthermore, when a circuit block called “ALPU” is formed by the random logic in such a manner that the retrieval in the buffers is supported by the hardware as described in “A hardware acceleration unit for MPI queue processing” by K. D. Underwood, K. S. Hemmert, A. Rodrigues, R. Murphy, and R. Brightwell, a size of the logic circuit is increased. It causes adverse effects such as restriction on other circuits in large scale integration (LSI) or a capacity of the buffer, a limitation on a frequency, or an increase in power consumption. Moreover, in a large-scale parallel system, because it is difficult to implement a sufficient number of ALPUs in the LSI, such an insufficiency needs to be supported by software, which results in a performance degradation.
Moreover, in the LHS mechanism described in “Support function for MPI on DIMMnet-3 network interface”, study report by Information processing society of Japan special interest group on computer architecture, the first parts of the messages are stored in the memory with a low latency in the order of message reception. Therefore, if a large number of messages each having a receive key that does not match a receive key specified by the receive function of the MPI is received before a message having a matching receive key is received, it is necessary to perform comparison on the messages a large number of times in the order of reception, starting from the leading message, which results in a significant performance degradation.