This invention relates to message passing in a computer system with a plurality of asynchronous computing nodes interconnected for transmission of messages between threaded user tasks executing in ones of the computing nodes, and in particular, to a capability for user handling of one or more message packets transmitted from a source computing node (sender) to a receiver computing node (receiver) in a computing environment, wherein the receiver has a threaded message passing interface (MPI).
Technological advances have made it possible to interconnect many processors and memories to build powerful, cost effective computer systems. Distributing computation among the processors allows for increased performance due to improved parallel execution. The performance of a multi-node computing system, however, depends on many factors such as flow control mechanisms, scheduling, the interconnection scheme between the nodes of the system, and the implementation of inter task communication.
A multi-task or parallel application includes multiple user tasks running on multiple nodes of one computer system or multiple computer systems. The user tasks communicate with one another via a message passing interface on the nodes running the user tasks. Specifically, a message packet can be sent within a multi-node computer environment between user tasks executing in ones of the computing nodes. The message packet is transmitted from a source computing node (sender) to a receiver computing node (receiver). Conventionally, user tasks communicate with one another via a message passing mechanism, such as defined by the Message Passing Interface (MPI) Standard. The MPI Standard is described, for example, in message passing interface format materials entitled xe2x80x9cMPI: A Message-Passing Interface Standard, Version 1.1,xe2x80x9d University of Tennessee, Knoxville, Tenn., Jun. 6, 1995, the entirety of which is hereby incorporated herein by reference.
Unfortunately, the MPI Standard provides no mechanism for asynchronous notification of receipt of a message or message packet at a user task, so programmers must devise other ways of receiving notification of events.
In one implementation of the MPI Standard, embodied in the IBM Parallel Environment for AIX (herein referred to as the xe2x80x9csignal handling libraryxe2x80x9d), arrival of. a message packet at a receiving node may cause a UNIX SIGIO signal to be sent to the receiving process. When a process receives a signal, its normal instruction stream is interrupted, and the first instruction in the registered signal handler is executed. Control remains in the signal handler until it returns. The MPI library registers a signal handler to catch this signal. The MPI-library-supplied signal handler reads the packet into the user""s memory, checks: it for duplication and valid format, and matches it to its destination. Then it returns, causing the user""s program to resume execution at the point of interruption.
Although the MPI library registers a signal handler for SIGIO, it is well-known that the user program may also register a signal handler for the same signal, in which case the user""s signal handler gets control when a SIGIO signal is sent to a process. In this way, a user may obtain notification of a message packet arrival by intercepting the SIGIO signal intended for the MPI library.
The development of multi-processor computing nodes has been accompanied by the development of programming models that can exploit these hardware platforms. One such model is a xe2x80x9cthreadsxe2x80x9d model, which has recently been standardized by the POSIX Organization. Basic thread management under the POSIX Standard is described, for example, in a publication by K. Robbins and S. Robbins entitled Practical UNIX Programmingxe2x80x94A Guide to Concurrency, Communication and Multi-Threading, published by Prentice Hall PTR (1996). Briefly described, when a program executes, the CPU uses the process program counter value to determine which instruction to execute next. The resulting stream of instructions is called the xe2x80x9cprogram""s thread of executionxe2x80x9d.
A natural extension of the process model is to allow multiple threads to execute within the same process. This extension provides an efficient way to manage threads of execution that share both code and data by avoiding context switches. Each thread of execution is associated with a xe2x80x9cthread,xe2x80x9d i.e., an abstract data type representing flow of control within a process. A xe2x80x9cthreadxe2x80x9d has its own execution stack, program counter value, register set, and state. By declaring many threads within the confines of a single process, a programmer can achieve parallelism at low overhead. in an alternative implementation of the MPI Standard, embodied in the IBM Parallel Environment for AIX (herein referred to as xe2x80x9cthe threaded libraryxe2x80x9d), arrival of a message packet at a receiving node may wake a thread in the receiving process, causing it to resume execution concurrently with other threads comprising the process. This thread is called an xe2x80x9cinterrupt service threadxe2x80x9d, and the thread is created when the MPI library is initialized. The interrupt service thread reads the packet into the user""s memory, checks it for duplication, and valid format, and matches it to the destination. Then it calls a function provided by the AIX kernel to sleep until the next message packet arrives. However, the user""s program has no knowledge of the identity of the interrupt service thread, and hence has no way to obtain notification of packet arrival, or to take action based thereon.
Thus, a need exists in the art for a threaded MPI which allows threaded user tasks to be notified of an external event such as receipt of a message packet, and which allows the user task to take a predefined action in response to asynchronous arrival of the message packet at the receiver.
Briefly summarized, in one aspect a method for processing a message packet within a computer environment having a plurality of threaded computing nodes is provided. The computing nodes are interconnected for transmission of messages between threaded user tasks executing asynchronously in ones of the computing nodes. A message is transmitted as at least one message packet from a source computing node (sender) to a receiver computer node (receiver). The receiver includes a threaded message passing interface (MPI). The method includes: responsive to asynchronous arrival of the at least one message packet at the receiver, employing an interrupt service thread at the receiver to call a user-defined program; and, employing the user-defined program to take a predefined action in response to the asynchronous arrival of the at least one message packet at the receiver.
To restate, a technique for user handling of asynchronous message packets in a multi-node threaded computing environment is provided. The technique involves defining an interrupt service thread in the MPI library and giving it a means to invoke a user-predefined program on arrival of a message packet. The user-predefined program is then employed to take responsive action to the asynchronous arrival of the at least one message packet at the receiver. This action will typically include initiating a receipt function to interpret the message packet at the threaded MPI. Thus, in accordance with the present invention, threaded user tasks are provided with the capability to respond to an asynchronous message arrival without requiring polling or creating additional threads. Further, the user-defined program, which is called by the interrupt service thread in accordance with the present invention, does not have to be signal safe (i.e., a program function which can be called without corrupting other data). One anticipated use is that the user""s code will execute thread calls to wake other threads waiting on multiple events. Advantageously, in accordance with the present invention xe2x80x9csignal handlersxe2x80x9d are reinterpreted for a threaded environment to give a user task the same sort of capability in the threaded environment that the user task would have had in a signal handling environment.