1. Field of the Invention
The present invention generally relates to data processing and more particularly to a queuing injection strategy in a parallel computing system.
2. Description of the Related Art
Powerful computers may be designed as highly parallel systems where the processing activity of hundreds, if not thousands, of processors (CPUs) are coordinated to perform computing tasks. These systems are highly useful for a broad variety of applications including, financial modeling, hydrodynamics, quantum chemistry, astronomy, weather modeling and prediction, geological modeling, prime number factoring, image processing (e.g., CGI animations and rendering), to name but a few examples.
For example, one family of parallel computing systems has been (and continues to be) developed by International Business Machines (IBM) under the name Blue Gene®. The Blue Gene/L architecture provides a scalable, parallel computer that may be configured with a maximum of 65,536 (216) compute nodes. Each compute node includes a single application specific integrated circuit (ASIC) with 2 CPU's and memory. The Blue Gene/L architecture has been successful and on Oct. 27, 2005, IBM announced that a Blue Gene/L system had reached an operational speed of 280.6 teraflops (280.6 trillion floating-point operations per second), making it the fastest computer in the world at that time. Further, as of June 2005, Blue Gene/L installations at various sites world-wide were among five out of the ten top most powerful computers in the world.
IBM is currently developing a successor to the Blue Gene/L system, named Blue Gene/P. Blue Gene/P is expected to be the first computer system to operate at a sustained 1 petaflops (1 quadrillion floating-point operations per second). Like the Blue Gene/L system, the Blue Gene/P system is scalable with a projected maximum of 73,728 compute nodes. Each compute node in Blue Gene/P is projected to include a single application specific integrated circuit (ASIC) with 4 CPU's and memory. A complete Blue Gene/P system is projected to include 72 racks with 32 node boards per rack.
In addition to the Blue Gene architecture developed by IBM, other highly parallel computer systems have been (and are being) developed. For example, a Beowulf cluster may be built from a collection of commodity off-the-shelf personal computers. In a Beowulf cluster, individual computer systems are connected using local area network technology (e.g., Ethernet) and system software is used to execute programs written for parallel processing on the cluster.
The compute nodes in a parallel system communicate with one another over one or more communication networks. For example, the compute nodes of a Blue Gene/L system are interconnected using five specialized networks. The primary communication strategy for the Blue Gene/L system is message passing over a torus network (i.e., a set of point-to-point links between pairs of nodes). The torus network allows application programs developed for parallel processing systems to use high level interfaces such as Message Passing Interface (MPI) and Aggregate Remote Memory Copy Interface (ARMCI) to perform computing tasks and to distribute data among a set of compute nodes. Other parallel architectures (e.g., a Beowulf cluster) also use MPI and ARMCI for data communication between compute nodes. Of course, other message passing interfaces have been (and are being) developed. Low level network interfaces communicate higher level messages using small messages known as packets. Typically, MPI messages are encapsulated in a set of packets which are transmitted from a source node to a destination node over a communications network (e.g., the torus network of a Blue Gene system).
A “message passing protocol” is a set of instructions specifying how to create a set of packets from a message and how to reconstruct the message from a packet stream. Message passing protocols may be used to transmit packets in different ways depending on the desired communication characteristics. In a parallel system where a compute node has multiple communication links to other nodes, each compute node can send a point-to-point message to any other node. Typically, packets injected onto the network generally follow one of two types of routing, adaptive or deterministic.
“Adaptive routing” is used where a routing decision is made by the network hardware at each hop in the network, causing packets to travel down the least congested network. Packets may arrive at the destination out-of-order if one path is less congested than another. Another source of out-of-order delivery is from packets being injected into the network using multiple injection queues. As is known, multiple injection queues may drain packets onto the network at different rates depending on wire congestion from cut through traffic or other network hot spots. If multiple queues are draining packets from the same message, the packets may be injected onto the network out of sequence and, therefore, may arrive at the destination out-of-order, even if the packets each use the same path between compute nodes.
To eliminate the out-of-order delivery, the same software message queue must be used, the same packet queue must be used, and deterministic routing must be used. In deterministic routing, the path between any two nodes always traverses the same route. For example, for a parallel system linking compute nodes in three dimensions, packets may always be routed in first in an x-dimension, then in a y-dimension, then in a z-dimension. Thus, to send a message from a compute node at position <0, 0, 0> to a compute node at position <5, 5, 5>, packets first traverse the x-plane to <5, 0, 0> then traverse the y-plane to <5, 5, 0>, and finally in the z-plane to the destination of <5, 5, 5>. Using deterministic routing allows packets to be delivered in order. However, achieving in-order delivery is not always desirable because the synchronized/ordered network delivery frequently leads to poor performance. Further, this approach does not effectively use the available communication links often present in a parallel system and deterministic routing cannot avoid any localized network congestion encountered along the static route.
Accordingly, there remains a need for an injection and queuing strategy that takes advantage of a network having multiple communication links or paths, but at the same time preserves higher order message semantics such as in-order processing of message packets.