This invention relates generally to communication mechanisms for computers. More particularly, this invention relates to a communication mechanism for shared memory multiprocessor computers such as a NUMA (non-uniform memory access) or UMA (uniform memory access) machine. The invention, however, also has applicability in single processor computers as a communication mechanism between multiple processes that can share the physical memory of the computer.
Multiprocessors are computers that contain multiple processors that can execute multiple parts of a computer program or multiple distinct programs simultaneously, in a manner known as parallel computing. In general, multiprocessor computers execute multithreaded or single-threaded programs faster than conventional single processor computers, such as personal computers (PCs), that must execute programs sequentially. The actual performance advantage is a function of a number of factors, including the degree to which parts of a program can be executed in parallel and the architecture of the particular multiprocessor at hand.
Multiprocessors may be characterized as tightly-coupled or loosely-coupled machines. A loosely-coupled machine is composed of multiple independent processor/memory systems or xe2x80x9cnodes.xe2x80x9d Access by one node to the memory of another requires that an explicit (program-initiated) message passing operation be accomplished. The physical address space of a loosely-coupled machine consists of multiple private address spaces (one such space per node) that are logically disjoint and cannot be directly addressed by a remote processor. In such machines, a processor can thus only directly access the memory of the node in which the processor resides. Loosely-coupled machines are sometimes referred to as message-passing machines or massively-parallel processors (MPPs). Programs for these machines are referred to as message-passing programs. In a tightly-coupled machine, in contrast, there is only one logical address space. All memory can be directly referenced by any program, subject to memory protection policies enforced by the machine""s operating system.
Tightly-coupled machines can be subdivided into two groups. Multiprocessors in which the physical memory is centralized and shared are called uniform memory access (UMA) machines because all memory accesses have the same performance cost. Multiprocessors in which the shared memory is distributed among separate nodes are called distributed shared memory (DSM) machines. The term shared memory refers to the fact that the physical memory of the machine is shared; that is, all processors (one or many) can directly access the entire shared physical memory of the machine without having to perform any message operations like those of loosely-coupled machines. In a loosely-coupled machine, in contrast, a processor on a node must use specialized mechanisms to copy data from a remote node to its own local memory or to the remote node from its own local memory. This is because the only memory a processor on a node in such a machine can directly access is its own local memory. DSM machines are also called NUMA (non-uniform memory access) machines, since the access time depends on the location of a data word in memory. Tightly-coupled machines are also called shared memory machines.
Message-passing machines (which have multiple separate physical address spaces, one per node and accessible only by processors on that node) and shared memory machines (which have a shared physical memory directly addressable by all processors in the machine) have different communication mechanisms. For a shared memory machine, the shared physical memory can be used to communicate data implicitly via processor instructions that write and read the shared memory; hence the name xe2x80x9cshared memoryxe2x80x9d for such machines. For a message-passing machine with its multiple separate per-node physical address spaces, communication of data is done by explicitly passing messages among the processors. For example, if one processor wants to access or operate on data in another node""s memory, it can send a message to request the data or to perform some operation on the data. In such cases, the message can be thought of as a remote procedure call (RPC). When the destination node receives the message, either by polling for it or via an interrupt, it performs the operation or access on behalf of the requesting processor on the requesting node and returns the results with a reply message.
Message passing machines have not been as commercially successful as shared memory machines for a number of reasons, including the time they require to pass large amounts of shared data from one node to another. Each message can only carry a limited amount of data, so sending a large amount of shared data can take many messages. In addition, the sending of messages between nodes in a message-passing machine requires entry into the operating system to program the message-passing hardware associated with each node. In return for this operating system overhead, however, a message-passing architecture avoids the need to keep a single shared memory image transparently coherent across all processors in the machine. Instead, the explicit message-passing calls that a message-passing machine requires for communication between its constituent nodes effectively tells the memory access hardware of a node to be careful about memory coherency. This notice occurs as the hardware sends or receives a message concurrently with the processor(s) on that node accessing the node""s memory.
This memory coherency is a requirement in a shared memory architecture in order to share data validly among the processors. Programs running on a shared memory machine rely on the underlying hardware to keep a coherent image of memory at all times, without special message-passing operations. If, for example, two or more processors (or threads or processes, etc. running on these processors) on a shared memory machine wish to communicate via shared memory, they can easily do so by arranging to share access to the same physical memory. Simple processor read and write operations to the memory so shared then allow the communication to directly occur. Consequently, it is possible to make a message-passing application, originally written for a message-passing machine, run on a shared memory machine by replacing its message-passing calls with a software layer that re-implements via shared memory the original message-passing layer of the application.
In the case of DSM machines, application software can often xe2x80x9chidexe2x80x9d or greatly reduce the performance impact of having non-uniform access to different portions of the machine""s physical memory. The software partitions itself across the different nodes and then communicates data between the nodes using a message-passing programming model that minimizes cross-node memory traffic. In this way, the software is able to maximize the performance of a DSM machine by using the transparent nature of the globally shared memory model to implement a light weight message-passing layer.
To support this light weight message-passing layer atop shared memory, however, there must be a means by which all of the cooperating communicating xe2x80x9cagentsxe2x80x9d (be they processors, threads running on processors, processes running on processors, etc.), are able to tell when a message has been completely sent (versus a partial message that is still being delivered). This means must also allow messages so sent to be received, and it must allow for the maximum possible concurrency of the senders and receivers. Traditional means include the use of synchronization mechanisms such as the acquisition of a mutual exclusion lock or xe2x80x9cmutexxe2x80x9d to guard the state of the shared memory in which the communication is occurring. For instance, a sending thread (or a process or equivalents thereof) acquires a mutex, deposits its message, and then releases the mutex in the case where multiple concurrent threads are communicating by using data structures in the same shared memory. Similarly, a receiving thread acquires the same mutex to check for the presence of messages in the shared memory. If one exists, the thread reads and consumes the message while holding the mutex. By doing so, the receiving thread prevents sending threads, which may be attempting concurrently to send messages to the shared memory, from interfering with the act of receiving.
The acquisition and holding of such mutexes, unfortunately, has several disadvantages. It limits the possible concurrency of message-passing software. It also naturally imposes overhead in the form of operating system calls to obtain and release such mutexes. Even worse, it does not handle preemption well. A first process (or thread) may find itself preempted by a second process (or thread) while holding the mutex that guards the shared memory in which messages are being passed. The mutex remains held while the first process is preempted, preventing other processes from accessing this shared memory. If the second process finds itself needing to acquire the mutex held by the preempted first process, then the second process is blocked until the first process is resumed and run to the point of releasing the mutex. If the two processes are running at different fixed priorities, and the operating system strictly enforces priority in its scheduling decisions, then deadlock can occur if the priority of the second process is higher than the first. To avoid this deadlock in the use of mutexes to guard shared data, the operating system must implement complex priority-inheritance mechanisms in which the first process is temporarily resumed with an elevated priority that allows it to complete its work to the point of releasing the mutex to the second process.
An objective of the invention, therefore, is to provide an improved communication mechanism for passing messages on a uniprocessor machine or on a shared memory multiprocessor such as a NUMA or UMA machine. Another objective is to provide a lock-free mechanism for successfully passing messages between processes even if a process is preempted while sending or receiving a message. Yet another objective of the invention is to provide such a mechanism that, where appropriate, adapts existing-message passing software to run on a shared memory machine.
For further information on multiprocessors and their different communication mechanisms, a number of works are instructive, including Computer Architecture: A Quantitative Approach (2nd Ed. 1996), by D. Patterson and J. Hennessy, and UNIX Internals: The New Frontier (1996), by Uresh Vahalia, both of which are incorporated by reference.
In accordance with the invention, a method for communicating messages between processes via memory of a computer comprises providing a mailbox data structure for a second process in memory accessible to a first process and a second process, the data structure having one or more message slots for storing messages sent to the data structure. For sending messages, the data structure includes an availability indicator for indicating if a message slot is available for receiving a message. Messages are sent from the first process to the second process via the mailbox data structure by obtaining, in an atomic operation, the present value of the availability indicator and changing the present value to a new value; determining from the present value of the indicator if a message slot is available for receiving a message; and sending the message to the data structure if a message slot is available. For receiving messages, the data structure includes a presence indicator for indicating if a message is present in a message slot. A message is received by the second process determining from the presence indicator that a message is present in a message slot; removing the message from the slot; and changing, in an atomic operation, the value of the presence indicator to indicate that the message is no longer present in the message slot.
In one aspect of the invention, the data structure may include a set of message slots storing messages sent to the data structure and a set of indicators manipulatable by an atomic operation for inserting messages into and removing messages from the message slots. The indicators may be a set of state variables comprising a first variable indicating if a message slot is available for storing a message; a second variable indicating a location of an available slot; a third variable indicating whether a sent message has been received in a message slot; and a fourth variable indicating a location of a slot containing a received message.
In another aspect of the invention, it may be implemented on a multiprocessor such as distributed shared memory machine with multiple nodes. A mailbox data structure is provided on each node to allow the nodes to pass messages to each other, using existing message-passing software adapted to run on such a machine.