1. Field of the Invention
The present invention generally relates to parallel computing. More specifically, the present invention relates to a low latency, high bandwidth, message passing protocol for use in a parallel computer system.
2. Description of the Related Art
Powerful computers may be designed as highly parallel systems where the processing activity of hundreds, if not thousands, of processors (CPUs) are coordinated to perform computing tasks. These systems are highly useful for a broad variety of applications including, financial modeling, hydrodynamics, quantum chemistry, astronomy, weather modeling and prediction, geological modeling, prime number factoring, image processing (e.g., CGI animations and rendering), to name but a few examples.
For example, one family of parallel computing systems has been (and continues to be) developed by International Business Machines (IBM) under the name Blue Gene®. The Blue Gene/L architecture provides a scalable, parallel computer that may be configured with a maximum of 65,536 (216) compute nodes. Each compute node includes a single application specific integrated circuit (ASIC) with 2 CPU's and memory. The Blue Gene/L architecture has been successful and on Oct. 27, 2005, IBM announced that a Blue Gene/L system had reached an operational speed of 280.6 teraflops (280.6 trillion floating-point operations per second), making it the fastest computer in the world at that time. Further, as of June 2005, Blue Gene/L installations at various sites world-wide were among five out of the ten top most powerful computers in the world.
IBM is currently developing a successor to the Blue Gene/L system, named Blue Gene/P. Blue Gene/P is expected to be the first computer system to operate at a sustained 1 petaflops (1 quadrillion floating-point operations per second). Like the Blue Gene/L system, the Blue Gene/P system is scalable with a projected maximum of 73,728 compute nodes. Each compute node in Blue Gene/P is projected to include a single application specific integrated circuit (ASIC) with 4 CPU's and memory. A complete Blue Gene/P system is projected to include 72 racks with 32 node boards per rack.
In addition to the Blue Gene architecture developed by IBM, other highly parallel computer systems have been (and are being) developed. For example, a Beowulf cluster may be built from a collection of commodity off-the-shelf personal computers. In a Beowulf cluster, individual systems are connected using local area network technology (e.g., Ethernet) and system software is used to execute programs written for parallel processing on the cluster of individual systems.
The compute nodes in a parallel system communicate with one another over one or more communication networks. For example, the compute nodes of a Blue Gene/L system are interconnected using five specialized networks, and the primary communication strategy for the Blue Gene/L system is message passing over a torus network (i.e., a set of point-to-point links between pairs of nodes). This message passing allows programs written for parallel processing to use high level interfaces such as Message Passing Interface (MPI) and Aggregate Remote Memory Copy Interface (ARMCI) to perform computing tasks and to distribute data among a set of compute nodes. Other parallel architectures (e.g., a Beowulf cluster) also use MPI and ARMCI for data communication between compute nodes. Low level network interfaces communicate higher level messages using small messages known as packets. Typically, MPI messages are encapsulated in a set of packets which are transmitted from a source node to a destination node over a communications network (e.g., the torus network of a Blue Gene system).
A “message passing protocol” is a set of instructions specifying how to construct a message from a packet stream. Message passing protocols can transmit packets in different ways depending on the desired communication characteristics. Additionally, packets may be “fully described” in which part of the packet payload stores metadata describing the message or “partially described” in which most packet metadata is omitted from individual packets. Fully described packets may be transmitted at any time, and may be routed dynamically. In contrast, partially described packets require a communication context to be previously established between a message sender and receiver.
On both a Blue Gene system and other parallel computing systems, low latency messaging is often implemented using a low latency protocol (sometimes called eager messages) and high bandwidth messaging is implemented using a high bandwidth protocol (sometimes called rendezvous messages). Which message passing protocol is used may depend on cutoffs based on message size.
To achieve low message latency, a low latency protocol may specify to send a fully described initial packet followed by partially described data packets and to route all packets deterministically to maintain packet order. Alternatively, such a protocol may specify to send only fully described packets and to route the packets dynamically. In either case, the low latency protocol provides a low bandwidth due to the requirement that all packets be fully (or partially) described. This requirement limits the amount of message data that may be included in each individual packet. Further, because deterministically routed packets each take the same route form a source to a destination, there is no opportunity to “route around” any congested network segments.
In contrast, to achieve high message bandwidth, a message passing protocol may specify to transmit partially described packets and to have packets routed dynamically. This protocol maximizes both the amount of data to be transmitted as well the number of packets transmitted per unit time. However, the high bandwidth protocol requires a communication context be initialized between a source and destination node before the high level message (e.g., an MPI message) is sent. Typically, to establish the communication context, a source node transmits a “request to send” packet to destination node. In response, the destination node sets up a communication context for the message and returns a “clear to send” message to the source node. During this initialization, no data packets are sent. Thus, high bandwidth protocols provide limited latency, as the communication context needs to be initialized before any data packets containing the actual message are transmitted.
While each of these message passing protocols function as intended, maintaining both protocols requires separate code bases, multiplying the code that must be created, debugged and maintained. Further, it is often desirable to have messages transmitted in both a low latency, high bandwidth manner, particularly in a highly parallel computing system where even very small amounts of overhead can have a substantial impact on program performance.
Accordingly, there remains a need for a latency hiding message passing protocol that provides both low latency and high bandwidth characteristics for passing messages between nodes of a parallel computing system.