The invention relates generally to the field of digital computer systems, and more specifically to robust systems and methods for facilitating communications among processes executed in a shared-memory computer system.
Computers typically execute programs in one or more processes or threads (generally xe2x80x9cprocessesxe2x80x9d) on one or more processors. If a program comprises a number of cooperating processes which can be processed in parallel on a plurality of processors, sometimes groups of those processes need to communicate to cooperatively solve a particular problem. Two basic architectures have been for multi-processor computer systems, namely, distributed memory systems and shared memory systems. In a computer system constructed according to the distributed memory architecture, processors and memory are allocated to processing nodes, with each processing node typically having a processor and an associated xe2x80x9cnode memoryxe2x80x9d portion of the system memory. The processing nodes are typically interconnected by a fast network to facilitate transfer of data from one processing node to another when needed for, for example, processing operations performed by the other processing node. Typically in a computer constructed according to the distributed memory architecture, a processor is able to access data stored in its node memory faster that it would be able to access data stored in node memories on other processing nodes. However, contention for the node memory on each processing node is reduced since there is only one processor, that is, the processor on the processing node, which accesses the node memory for its processing operations, and perhaps a network interface which can access the node memory to store therein data which it received from another processing node, or to retrieve data therefrom for transfer to another processing node.
Typically, in a computer system constructed according to the shared memory architecture, the processors share a common memory, with each processor being able to access the entire memory in a uniform manner. This obviates the need for a network to transfer data, as is used in a computer system constructed according to the distributed memory architecture; however, contention for the shared memory can be greater than in a computer system constructed according to the distributed memory architecture. To reduce contention, each processor can be allocated a region of the shared memory which it uses for most of its processing operations. Although each processor""s region is accessible to the other processors so that they (that is, the other processors) can transfer data thereto for use in processing by the processor associated with the respective region, typically most accesses of a region will be by the processor associated with the region.
A computer system can be constructed according to a combination of the distributed and shared memory architectures. Such a computer system comprises a plurality of processing nodes interconnected by a network, as in a computer system constructed according to the distributed memory architecture. However, each processing node can have a plurality of processors which share the memory on the respective node, in a manner similar to a computer constructed according to the shared memory architecture.
Several mechanisms have been developed to facilitate transfer of data among processors, or more specifically, between processing node memories, in the case of a computer system constructed according to the distributed memory architecture, and/or memory regions, in the case of a computer system constructed according to the shared memory architectures. In one popular mechanism, termed xe2x80x9cmessage passing,xe2x80x9d processors transfer information by passing messages thereamong. Several well-known message passing specifications have been developed, including MPI and PVM. Generally, in message passing, to transfer data from one processor to another, the transferring processor generates a message including the data and transfers the message to the other processor. On the other hand, when one processor wishes to retrieve data from another processor, the retrieving processor generates a message including a retrieval request and transfers the message to the processor from which the data is to be retrieved; thereafter, the processor which receives the message executes the retrieval request and transfers the data to the requesting processor in a message as described above.
The invention provides a new and improved system and method for facilitating communications among processes in a shared memory computer system.
In brief summary, the invention provides a communications arrangement for facilitating transfer of messages among a plurality of processes in a computer system. The communications arrangement comprises a channel data structure, a status daemon and an exit handler. The channel data structure includes a channel status flag normally having one of a plurality of conditions, and a plurality of storage locations each configured to receive message information. The status daemon is configured to determine the operational status of the processes. The exit handler is configured to, in response to the status daemon determining a predetermined condition in connection with at least one of the processes, condition the channel status flag to another of the conditions, thereby to indicate to the other processes a failure condition in connection with the communications arrangement.