Modern day computer systems often process a number of computer applications, each having a number of processes. Each of such processes requires a number of resources including memory, input/output devices, microprocessor time, etc. Due to the high number of processes and the limited amount of available resources, often it is necessary that a number of programs are run simultaneously, which is known as parallel processing.
In parallel processing mode each of the various parallel programs consist of co-operating processes, each process having its own memory. These processes send data to each another in the form of messages, with each message having a tag that may be used to sort the messages. A common approach to parallel programming is to use a message passing library, where a process uses the library calls to exchange messages (information) with another process. Such message passing allows processes running on multiple processors to cooperate with each other.
Every time a message is received by a device, it generates and event for that device. For example, a central processing unit (CPU) of a computer may receive messages from a keyboard, a mouse, a modem, a process running on a math processor, etc. Each message generates an event for the CPU and the CPU processes these events according to an event processing algorithm/system. In general, various devices often use a special type of event processing system for managing various messages.
As a number of different applications provided by a number of different vendors may be running together in a parallel mode operation, it is necessary that such different processes communicate with each other. To ensure such communication between various processes, a standard known as message passing interface (MPI), defined by a group of organizations including various vendors and researchers is used. Generally speaking, an MPI is an interface designed to allow a user to code data such that processors in a network can send and receive data and information. MPI is available on a wide variety of platforms ranging from massively parallel systems (Cray T3D, Intel Paragon, etc.) to networks of workstations (Sun4, etc.). Most of the commonly used MPI systems operate according to an MPI standard, which specifies point to point communication in the form of various send and receive calls from multiple fabrics to one or more applications.
One of the problems encountered by an MPI implementation that has to drive several communication fabrics at once (e.g., shared memory and some network(s)) is the so called “multi-fabric MPI_ANY_PROCESS receive operation”. This term refers to a very common situation when any fabric can serve as a source of an incoming message. It occurs every time when the wildcard MPI_ANY_SOURCE is used as the rank of the source process in operations MPI_Recv and MPI_Irecv, and the respective MPI communicator spans several fabrics.
According to the MPI standard, the MPI implementation has to ensure at any time that the messages delivered from any one source MPI process to this destination process remain ordered within communicator and tag scope (ordering requirement), and that any fabric can make progress on the respective messages despite any blocking possibly experienced by other fabrics (progress requirement). Achieving these goals in a multi-fabric case would be simple if at the same time it were not imperative to retain the latency and bandwidth of the message passing as much unaffected as possible.
There are at least two known methods that try to solve this problem. According to one method, one can poll all fabrics according to a certain strategy that ensures progress. In this case the polling operations must not block, and thus cannot use potentially more efficient and less CPU intensive waiting mechanisms. In addition to this, some adaptive polling scheme is to be introduced in order to accommodate possible differences in the typical latencies of the respective fabrics.
According to a second method, one can ask all fabrics to watch for a certain message, and then let the fabric that detects a matching message first to proceed with the receive operation. In this case, one has to cancel (potentially partially completed) receive requests in all other fabrics, which may or may not be possible. Note also that in the latter case, some mutual exclusion scheme must be introduced if the respective fabrics are to be allowed to deliver data directly to the application memory.
To address the above problems, it is desirable to provide a more efficient MPI system that efficiently processes incoming messages from multiple fabrics, while meeting the MPI requirements of ordering and progress among the multiple fabrics and retaining the latency and bandwidth of the message passing as much unaffected as possible.