The invention relates to multiprocessor computers and more particularly to a message passing interface (MPI) application programming interface (API) for passing messages between multiple tasks or processes. The present invention is even more particularly related to message passing using a shared memory buffer.
S/390 and IBM are registered trademarks of International Business Machines Corporation, Armonk, N.Y., U.S.A. and Lotus is a registered trademark of its subsidiary Lotus Development Corporation, an independent subsidiary of International Business Machines Corporation, Armonk, N.Y. Other names may be registered trademarks or product names of International Business Machines Corporation or other companies.
Message Passing Interface (MPI) defines a standard application programming interface (API) for using several processes at one time to solve a single large problem called a xe2x80x9cjobxe2x80x9d on a symmetric multiprocessor and often multi-node computer (i.e., commonly one process per node). Message passing is equally applicable however to a uniprocessor computer. Each job can include multiple processes. A process can also commonly be referred to as a task. Another software structure analogous to a task is a thread, which can be thought of as a small software component used in multi-tasking, multi-threaded software systems.
Each process, task or thread can compute independently except when it needs to exchange data with another task. When the process, task or thread needs to pass data from, e.g., one task to another, the process is known as passing a xe2x80x9cmessage.xe2x80x9d Examples of symmetric multiprocessor computers include, e.g., an IBM RISC System 6000/SP available from IBM Corporation, Armonk, N.Y., and supercomputers available from Cray, Silicon Graphics, Hewlett Packard, Thinking Machines, and other computers from such companies as SUN Microsystems, Hewlett Packard, Intel, and the like.
Specifically, a programmer can use an explicit MPI_SEND to identify what data from the memory of a source task is to be sent as a given message. The programmer can also use an explicit MPI_RECV at a destination task to identify where the data is to be placed in the receiver memory.
To simplify the description which follows, sending of messages will be described although the same processing would apply to the receiving of messages. To send a message, data is gathered from memory and fed to a transport layer at the rate that the transport layer is able to accept. Bytes of a message are forwarded in chunks which can be known as packets and the transport layer can dictate the size of each chunk. When the transport layer is ready to accept N bytes, then N bytes are copied from the proper memory locations into a transport buffer which can be referred to as a xe2x80x9cpipe.xe2x80x9d The data gather logic delivers a specific number of bytes at each activation and then at the next activation, picks up where it left off to deliver more bytes.
Receiving a message is a mirror image of the sending of one. Some number of bytes becomes available from a pipe and must be distributed. It would be apparent to those skilled in the art that the concepts involved in sending and receiving are so closely related that to understand one is to understand the other.
The MPI standard was designed for distant communication, i.e., for message passing between tasks executing on separate nodes. This type of message passing is referred to as internode message passing. When a sending and receiving task are located on the same node, message passing can be achieved by intranode message passing. Alternatively, for intranode communication, use of shared memory buffers for inter process or inter task communication is possible. Unfortunately, no provision exists for permitting transparent use of communication from both local sending tasks (intranode) and external sending tasks (internode) to a local receiving task.
U.S. Pat. No. 5,434,975 to Allen (xe2x80x9cAllenxe2x80x9d), the contents of which are incorporated herein by reference in their entireties, discloses a conventional IPC system. Conventionally, when a plurality of tasks, associated with a common processor node in, e.g., a uniprocessor or a symmetric multiprocessor computer system, sought to communicate with one another, various means of interprocess communication (IPC) could be used. For example, Allen describes a conventional shared memory (only) message passing system including a sender/receiver pair with message queues and xe2x80x9csignalingxe2x80x9d from the sender to receiver. Allen uses a Unix IPC called a semaphore. The receiver in Allen has only one source of messages, i.e., shared memory. Unfortunately, Allen does not teach or suggest a system that supports message passing where messages originate from two sources where one of the sources is not local so there is a need for something different than a conventional signal, since the IPC signal can only be generated for a local connection.
U.S. Pat. No. 5,652,885 to Reed et al. (xe2x80x9cReedxe2x80x9d), the contents of which are incorporated herein by reference in their entireties, discloses a system using a Unix datagram socket as a signaling mechanism, and messages are expected to be communicated entirely via shared memory. The receiver either waits for a select or for a signal. Reed also does not address message passing where messages originate from non-local sources.
U.S. Pat. No. 5,835,764 to Platt, (xe2x80x9cPlattxe2x80x9d), the contents of which are incorporated herein by reference in their entireties, discloses a xe2x80x9cremote-procedure-call-likexe2x80x9d mechanism in which various threads are suspended until their dependent (synchronous) functions are completed. Unfortunately, Platt also does not teach or suggest a system or method that handles a multiplicity of message source types (i.e. local and distant).
U.S. Pat. No. 5,469,549 to Simpson, (xe2x80x9cSimpsonxe2x80x9d), the contents of which are incorporated herein by reference in their entireties, discloses a system supporting communication via partitioned shared memory. Unfortunately, Simpson does not teach or suggest any external interfaces.
U.S. Pat. No. 5,313,638 to Ogle (xe2x80x9cOglexe2x80x9d), the contents of which are incorporated herein by reference in their entireties, discloses a system supporting UNIX semaphore synchronization, i.e., message passing into slots controlled by a semaphore. Unfortunately, Ogle does not teach or suggest any support for message passing from external device sources.
It is desired that an improved method be provided to permit transparent receipt of communications to local receiving tasks from local sending tasks and external sending tasks.
Briefly, the present invention provides a system, method and computer program product for transparently handling messages originating from local shared memory and from an external source. Conventional approaches supported shared memory only, or external source only as the only mechanism. The present invention includes a local sender task putting messages into shared memory, and a distant sender task sending messages via a communications link. The receiver task can initially be waiting for a packet arrival interrupt from the communication link. A hardware interrupt advantageously can call a software service notification function to wake the waiting thread of the receiver task (this waiting thread in one embodiment could be a Dijkstra semaphore). The software service notification function can be provided as part of an operating system (OS) by a kernel function, or more commonly, by a device driver which can support the communication link. The present invention can include adding an additional function to the device driver which can allow the local sender to identify and wake up the waiting receiver task thread, thereby simulating a packet arrival hardware interrupt. When the receiver task thread awakes, it can examine both shared memory and hardware message queues for work to do.
In an example embodiment of the present invention, a method is disclosed for transparently handling message passing from a plurality of local and external source tasks, the method providing a software routine that simulates handling of an occurrence of a hardware interrupt, the method including the steps of copying a local source message into a shared memory buffer from a local sending task, invoking a local notify system service module including passing a window number identifying a receiving task, invoking a wake thread module awakening a thread associated with the window number, examining the shared memory buffer for receipt of the local source message, and copying the local source message from the shared memory buffer to the receiving task.
In one embodiment of the present invention, the method can further include sending an external source message on a communications link by an external sending task, receiving the external source message from the communications link at a communication adapter, triggering a hardware interrupt by the communication adapter, copying the external source message into an external message buffer from the communication adapter, invoking a first level interrupt handler (FLIH) handling the hardware interrupt, invoking a second level interrupt handler (SLIH) handling the FLIH, invoking the wake thread module awakening a thread associated with the receiving task, examining the external message buffer for receipt of the external source message, and copying the external source message from the external message buffer to the receiving task.
In an embodiment of the present invention, a system for message passing from a plurality of local and external sources, is disclosed including a message passing network, a plurality of nodes coupled to the message passing network, a plurality of tasks executing on the plurality of nodes, a local sending task of the plurality of tasks on a first of the plurality of nodes, a receiving task of the plurality of tasks on the first of the plurality of nodes, an external sending task on a second of the plurality of nodes, an external source message handling service associated with the first of the plurality of nodes that enables the external sending task to pass an external source message from the external sending task to the receiving task, wherein the external sending task sends an external source message over the message passing network to a communication adapter associated with the first of the plurality of nodes, wherein the communication adapter can store the external source message in an external message buffer and creates a hardware interrupt, wherein the hardware interrupt invokes a first level interrupt handler (FLIH), wherein the FLIH invokes a second level interrupt handler (SLIH), wherein the SLIH invokes a wake thread module that wakes a thread associated with the receiving task that causes the receiving task to examine the external message buffer to receive the external source message, and a local notify system service associated with the first of the plurality of nodes that enables the local sending task to pass a local source message from the local sending task to the receiving task, including a copier module that copies the local source message to a shared memory buffer and a notification module that invokes the wake thread module that awakens a thread associated with the receiving task notifying the receiving task of the local source message in the shared memory buffer.
In an embodiment of the invention, the local notify system service is part of a device driver. In another embodiment, the local notify system service is part of an operating system. In another embodiment, the local notify system service is part of a database management system. In another embodiment, the local notify system service is part of a middleware application.
In an example embodiment of the invention, a system is disclosed having a first level interrupt handler (FLIH) that is triggered by a device event, wherein the FLIH invokes a second level interrupt handler (SLIH), wherein the SLIH invokes a wakeup thread module that awakens a receiving task that is awaiting the device event, the system including a software component that allows a local sending task to invoke the wakeup thread module notifying the receiving task of an event.
In an embodiment, the device event includes a hardware device event including an arrival of an external message from an external source sending task via a communications adapter, an end of tape event from a tape drive, or a new track event from a compact disk read only memory CD-ROM drive.
In an embodiment, the event can include copying of a local source message into a shared memory buffer from a local source sending task, completion of a task, or killing a task.
In an embodiment, the system can implement the message passing interface (MPI) message passing standard.
In an embodiment, the system is implemented on a symmetric multiprocessor system that implements a network connection to a message passing network.
In an embodiment, the system is implemented on a uniprocessor computer system running a multi-tasking operating system wherein a sending process and a receiving process are managed by the uniprocessor system and the sending process and the receiving process communicate by message passing inter process communication.
In an example embodiment of the present invention, a device driver is disclosed for a system having an external source message handling service that enables an external sending task to pass an external source message from the external sending task to a receiving task, wherein the external sending task sends an external source message over a message passing network to a communication adapter associated with the receiving task, wherein the communication adapter stores the external source message in an external message buffer and creates a hardware interrupt, wherein the hardware interrupt invokes a first level interrupt handler (FLIH), wherein the FLIH invokes a second level interrupt handler (SLIH), wherein the SLIH invokes a wake thread module that wakes a thread associated with the receiving task that causes the receiving task to examine the external message buffer to receive the external source message, wherein the device driver includes a local notify system service that enables a local sending task to pass a local source message from the local sending task to the receiving task, including a copy module that copies the local source message to a shared memory buffer, and a notification module that invokes the wake thread module that awakens a thread associated with the receiving task, and notifies the receiving task of the local source message in the shared memory buffer.
In an example embodiment of the present invention, an operating system including a kernel having a plurality of system services for a system having an external source message handling service that enables an external sending task to pass an external source message from the external sending task to a receiving task, wherein the external sending task sends an external source message over a message passing network to a communication adapter associated with the receiving task, wherein the communication adapter stores the external source message in an external message buffer and creates a hardware interrupt, wherein the hardware interrupt invokes a first level interrupt handler (FLIH), wherein the FLIH invokes a second level interrupt handler (SLIH), wherein the SLIH invokes a wake thread module that wakes a thread associated with the receiving task that causes the receiving task to examine the external message buffer to receive the external source message, wherein the plurality of system services includes a local notify system service that enables a local sending task to pass a local source message from the local sending task to the receiving task, including a copy module that copies the local source message to a shared memory buffer, and a notification module that invokes the wake thread module that awakens a thread associated with the receiving task, and notifies the receiving task of the local source message in the shared memory buffer.
In an example embodiment of the present invention, a computer program product embodied on a computer readable media including program code means including program logic for transparently handling message passing from a plurality of internal and external sources, including program code means for enabling a computer to send an external source message on a communications link from an external sending task, program code means for enabling the computer to receive the external source message from the communications link at a communication adapter, wherein the communication adapter triggers a hardware interrupt upon receipt of the external source message, program code means for enabling the computer to copy the external source message into an external message buffer from the communication adapter, program code means for enabling the computer to invoke a first level interrupt handler (FLIH) handling the hardware interrupt, program code means for enabling the computer to invoke a second level interrupt handler. (SLIH) handling the FLIH, program code means for enabling the computer to invoke a wake thread module awakening a thread associated with a receiving task, program code means for enabling the computer to examine the external message buffer for receipt of the external source message, program code means for enabling the computer to copy the external source message from the external message buffer to the receiving task, program code means for enabling the computer to copy a local source message into a shared memory buffer by a local sending task, program code means for enabling the computer to invoke a local notify system service module including passing a window number identifying the receiving task, program code means for enabling the computer to invoke the wake thread module awakening a thread associated with the window number, program code means for enabling the computer to examine the shared memory buffer for receipt of the local source message, and program code means for enabling the computer to copy the local source message from the shared memory buffer to the receiving task.
Further features and advantages of the invention, as well as the structure and operation of various embodiments of the invention, are described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements. The drawing in which an element first appears is indicated by the leftmost digits in the corresponding reference number.