1. Field of Invention
The present invention relates in general to the computer systems field. More particularly, the present invention relates to improving the message handling performance in a computer system that utilizes a shared network device, such as a massively parallel computer system or a distributed computer system.
2. Background Art
Supercomputers continue to be developed to tackle sophisticated computing jobs. These computers are particularly useful to scientists for high performance computing (HPC) applications including life sciences, financial modeling, hydrodynamics, quantum chemistry, molecular dynamics, astronomy, space research and climate modeling. Supercomputer developers have focused on massively parallel computer structures to solve this need for increasingly complex computing needs. A family of such massively parallel computers is being developed by International Business Machines Corporation (IBM) under the name Blue Gene. In the Blue Gene computer systems, the system is generally organized into processing sets (referred to herein as “psets”) that contain one input/output (I/O) node and a number of compute nodes based on the configuration of the system. For each pset, the compute nodes and the I/O node communicate with each other by sending messages using a point-to-point feature of a collective network that connects each compute node with its I/O node.
In the Blue Gene computer systems, the I/O node of each pset generally has two main functions. First, the I/O node is used to control the compute nodes using a control message mode. Second, the I/O node is used to offload I/O operations from the compute nodes using a streaming message mode. The two message modes (i.e., control message mode and streaming message mode) have differing requirements with respect to message handling. In control message mode, the I/O node needs to receive a message from every compute node in its pset before sending reply messages back to the compute node. In streaming message mode, the I/O node needs to receive a request message, process the I/O request, and send the reply message before handling another message.
The Blue Gene computer systems communicate over several communication networks. The compute nodes are arranged into both a logical tree network and a logical 3-dimensional torus network. The logical tree network connects the compute nodes in a binary tree structure so that each node communicates with a parent and two children. Each compute node communicates with its I/O node through the tree network (also referred to herein as a “collective network”). The torus network logically connects the compute nodes in a lattice-like structure that allows each compute node to communicate with its closest six neighbors.
The Blue Gene/L system is a scalable system in which the current architected maximum number of compute nodes is 131,072 (1024 compute nodes per rack×128 racks), and the current maximum number of I/O nodes is 16,384 (128 I/O nodes per rack with an 8:1 compute node to I/O node ratio×128 racks). Each of the Blue Gene/L compute nodes consists of a single ASIC (application specific integrated circuit) with two CPUs and memory. Currently, the number of compute nodes in a pset can be 8, 16, 32, or 64 in the Blue Gene/L system. The full computer would be housed in 128 racks or cabinets, with thirty-two node cards or boards in each rack. Currently, the biggest Blue Gene/L system is 104 racks. The maximum number of compute nodes per node card is thirty-two. The maximum number of I/O nodes is 128 per rack (i.e., each rack has two midplanes and each midplane may contain 8-64 I/O nodes).
The Blue Gene/P system is a scalable system in which the current architected maximum number of compute nodes is 262,144 (1024 compute nodes per rack×256 racks), and the current maximum number of I/O nodes is 16,384 (64 I/O nodes per rack with a 16:1 compute node to I/O node ratio×256 racks). The Blue Gene/P compute nodes and I/O nodes each consist of a single ASIC with four CPUs and memory. Currently, the number of compute nodes in a pset can be 16, 32, 64, or 128 in the Blue Gene/P system. The full computer would be housed in 256 racks or cabinets, with thirty-two node cards or boards in each rack. The maximum number of compute nodes per node card is thirty-two, and the maximum number of I/O nodes per node card is two.
Generally, when receiving messages from a network device, packets (i.e., each message includes a plurality of packets) need to be received as quickly as possible for best performance. The network device is typically shared by two or more CPUs (also referred to herein as “processors”) and is managed by the operating system, so the network device can be shared by multiple users. Typically, this sharing of the network device requires receiving the packets into a temporary buffer and then copying the complete message to the user's buffer. This sequence of operations (also referred to herein as a “memory copy” and “data copying”) significantly reduces message handling performance but is typically required because the identity of the processor that is to receive the packets is indeterminate until all of the packets have been stored in the temporary buffer. A header may be utilized to identify the processor that is to receive the packets (e.g., each message may include a one packet header), but because the packets are typically not delivered in order the processor to receive the packets effectively remains unknown until all of the packets have been stored in the temporary buffer.
An additional reason this performance-robbing sequence of operations is typically required occurs when the processors that share the network device can start a thread on another processor, for example, in symmetric multi-processing (SMP) mode. This sequence of operations is required in systems with additional threading capability because the identity of the processor running the thread that is to receive the packets is indeterminate until all of the packets have been stored in the temporary buffer.
On the Blue Gene/L system, each compute node has one collective network device that is shared by the compute node's two processors. The compute node kernel (CNK) running on the compute node processors uses the collective network device to send and receive messages from an I/O node daemon running on the I/O node. When an application is started on the compute nodes, control message mode is used to communicate with the I/O node. When the application is running on the compute nodes, streaming message mode is used to communicate with the I/O node.
IBM, “Method for optimizing message handling for streaming I/O operations”, IP.com no. IPCOM000146556D, IP.com Prior Art Database, Technical Disclosure, Feb. 16, 2007, discloses a method to dynamically switch between control message mode and streaming message mode to improve the message handling performance of streaming message mode. When submitting a job, control message mode is used to exchange control messages between the compute nodes and the I/O node in a pset. In the control message mode, a temporary buffer (i.e., a kernel buffer) is used. When running an application, the CNK switches to streaming message mode in which data can be put directly into the user's buffer without using a memory copy (i.e., receiving the packets into a temporary buffer and then copying the complete message to the user's buffer). However, the method disclosed in the above-noted IP.com Prior Art Database reference is directed to the Blue Gene/L system, which does not have additional threading capability (i.e., where the processors that share the network device can start a thread on another processor, for example, in symmetric multi-processing (SMP) mode), and does not address the performance-robbing need to use a memory copy in a system with additional threading capability.
In control message mode, as noted above, the I/O node receives the request messages from both processors on all of the compute nodes in its pset before sending any reply messages. The above-noted IP.com Prior Art Database reference discloses that during the control message mode, the CNK locks the collective network device, sends all of the packets in a request message to the I/O node, and unlocks the network device. Then the CNK waits for a reply message by locking the network device, checking for a packet, receiving one or more packets if available, and unlocking the network device. The CNK keeps checking for packets until a complete message has been received. In control message mode, it is possible for one processor to receive a packet intended for the other processor. For example, one processor may receive all of the packets of one reply message intended for that processor and all of the packets of another reply message intended for the other processor. Data in the packet header (i.e., there is a header on every packet) identifies which processor the reply message is intended for. Accordingly, each packet of the reply message is stored into a kernel buffer assigned to the processor in a shared data area of the compute node's memory.
In streaming message mode, as noted above, the I/O node receives a request message, processes the I/O request, and sends the reply message before handling another message. The above-noted IP.com Prior Art Database reference discloses that during the streaming message mode, the CNK locks the collective network device, sends all of the packets in a request message to the I/O node, receives all of the packets in the reply message, and unlocks the device. Since each processor has the collective network device locked for the complete exchange of the request and reply messages, the CNK knows that all of the packets in the reply message are for itself and the data can be put directly into the user's buffer. This method eliminates a memory copy of the user's data from a kernel buffer to the user's buffer. Even in the streaming message mode, as in the control message mode, there is a header on every packet. However, as noted above, the method disclosed in the IP.com Prior Art Database reference is directed to the Blue Gene/L system, which does not have additional threading capability (i.e., where the processors that share the network device can start a thread on another processor, for example, in SMP mode). The method disclosed in the IP.com Prior Art Database reference does not address the performance-robbing need to use such a memory copy in a system with additional threading capability.
On the Blue Gene/P system, each compute node has one collective node device that is shared by the compute node's four processors. The compute node kernel running on the compute node processors uses the collective network device to send and receive messages from an I/O node daemon running on the I/O node. The compute nodes in the Blue Gene/P system may be utilized in SMP mode, dual mode, or virtual node mode (VNM). There is no additional threading capability in VNM. However, both SMP mode and dual mode have additional threading capability. In SMP mode, for example, one of the processors runs a program's main process and the program can spawn up to three additional threads on the remaining processors.
The method disclosed in the IP.com Prior Art Database reference can be utilized to eliminate the use of a memory copy in VNM mode on the Blue Gene/P system because there is no additional threading capability in VNM. However, because both SMP mode and dual mode have additional threading capability, the method disclosed in the IP.com Prior Art Database reference can not be utilized to eliminate the use of a memory copy in SMP mode or dual mode on the Blue Gene/P system. Consequently, in SMP mode and dual mode on the Blue Gene/P system, the packets must be stored into a temporary buffer and then the complete message is copied to the user's buffer. This sequence of operations significantly reduces message handling performance but is required in systems with additional threading capability, such as the Blue Gene/P system, because the identity of the processor running the thread that is to receive the packets is indeterminate until all of the packets have been stored in the temporary buffer.
It should therefore be apparent that a need exists for improved message handling performance in computer systems, such as massively parallel computer systems or distributed computer systems, having shared network devices.