1. Field of the Invention
The present invention generally relates to parallel computing system and, more particularly, to a novel technique for asynchronous broadcasts where ordered delivery of broadcast messages is maintained between compute nodes in the parallel computing system where packet header space is limited.
2. Description of the Prior Art
To achieve high performance computing, multiple individual processors have been interconnected to work cooperatively to solve a computational problem allowing for parallel processing. In a parallel computing system multiple processors can be placed on a single chip, or several chips each containing one or several processors, embedded DRAM, system-on-chip integration, and local caches memories have been forming so-called “compute nodes” which interconnected forming a parallel computing system.
In designing high bandwidth/Floating point Operations Per Second (FLOP) parallel computing systems, such as IBM's Blue Gene/L™ it is sometimes desirable to provide a communication software architecture with low overhead for communications. For example, in Blue Gene/L™, the communication software architecture is divided into three layers; at the bottom is the packet layer, which is a thin software library that allows access to network hardware and at the top is the Message Passing Interface (MPI) library, discussed below. In between the packet layer and the MPI library layer is a single layer called the message layer that glues together the Blue Gene/L™ system. To achieve speed/efficiency of the system, a restriction is placed on the length of the packet header. In IBM's Blue Gene/L™ a packet header can only be a multiple of 32-bytes and is limited to no more than 256 bytes. The message layer is an active message system built on top of the packet layer that allows the transmission of arbitrary buffers among compute nodes with the assistance of the MPI library layer.
Parallel computer applications often use message passing to communicate between processors. The Message Passing Interface (MPI) specification is widely used for solving significant scientific and engineering problems on parallel computers and provides a simple communication API and eases the task of developing portable parallel applications. Message Passing Interface Forum, “MPI: A Message-Passing Interface Standard,” University of Tennessee, 1995; see http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html.
MPI supports two types of communication, 1) point-to-point and 2) collective. In point-to-point messaging, a processor sends a message to another processor that is ready to receive it. A point-to-point communication operation can be used to implement local and unstructured communications.
In collective communication operation many processors participate together in the communication operation. In other words, processors are collected into groups and each data packet sent from one processor to another are sent in a specific order or context and must be received in the same order or context. According to the MPI forum, contexts are further defined as providing the ability to have separate safe “universes” of message passing in MPI. Hence, a context is akin to an additional tag that differentiates messages. The parallel computer system manages this differentiation process using the MPI library. A group or context in MPI together forms a communicator, which encapsulates internal communication structures in a parallel computer system into modules. A processor is identified by its rank in the group associated with a specific communicator. Examples of collective operations are broadcast, barrier, all-to-all.
MPI implements a one-to-all broadcast operation whereby a single named processor (root) sends the same data to all other processors. In other words, MPI's broadcast operation provides a data movement routine in which all processors interact with a distinguished root processor so that each processor receives its data. At the time of broadcast call, the data to be communicated are located in a buffer in the root processor. The root processor's broadcast call consists of three arguments, the specific location of the data, the type of data and the number of elements to be sent to each destination. After the call, the data are replicated in the buffer of all processors in the communicator.