The present invention relates generally to systems and methods for enabling a first computer to transmit messages and data to a second computer; and more particularly to a system and method for ensuring that each message is sent to the second computer once and only once while retaining a high level of message transmission reliability and using a xe2x80x9cwrite onlyxe2x80x9d message sending protocol to make such remote write operations efficient.
In many multiple-processor computer systems it is important for processes or tasks running on one computer node (sometimes called the sender) to be able to transmit a message or data to another computer node (sometimes called the receiver), and to do so with absolute reliability. Also, it is extremely important that transmitted messages have a property called xe2x80x9cidempotency,xe2x80x9d which means that each message must be processed by the receiver exactly once. The reason message processing must be idempotent is best explained by example. If the message to be processed is xe2x80x9cmove the elevator up one floor,xe2x80x9d and it is processed the wrong number of times, the elevator will go up the wrong number of floors, or a failure condition may be generated if the elevator is ordered to go past a topmost or bottommost floor. If the message to be processed is xe2x80x9ctransfer $1000 from account A to Account B,xe2x80x9d and it is processed the wrong number of times, accounts A and B will have the wrong amount of money.
Message transmission reliability can be improved using both hardware and software mechanisms. An example of a hardware mechanism for improving reliability is to provide two parallel communication links between network nodes (or between each network node and the network medium), instead of just one. An example of a software mechanism for improving reliability is to verify each remote message write operation by performing a synchronous remote read operation after the remote message write operation. Another software mechanism for improving message transmission reliability is for the receiving system to explicitly acknowledge receipt of every message sent to it. In this latter example, the sending system may process the message acknowledgments asynchronously, allowing other messages to be sent before the acknowledgment of a prior message is processed.
Generally, transmitting messages between computer nodes is expensive in terms of latency and resources used if the successful transmission of each message is verified by performing a remote read operation after each such remote message write operation.
Alternately, instead of using remote reads to verify the successful transmission of each message, in some prior art systems a message is written locally to a local buffer, and then a xe2x80x9ccookiexe2x80x9d (which is primarily a data structure pointing to the memory location or locations where the message is stored) or other notification message is sent to the receiving system. The receiving system then performs a remote read operation to read the message from the remote memory location indicated in the notification message. In another implementation of this same basic prior art technique, both the message and the cookie are stored locally in the sending system and only a trigger message is transmitted to the receiving system. The receiving system responds to the trigger message by performing a first remote read operation to read the cookie and a second remote read operation to read the message at the location indicated by the cookie.
An advantage of the prior art techniques using remote read operations as an integral part of every message transmission is that remote reads are synchronous, and thus the system performing the remote read is notified immediately if the message transmission fails.
Another advantage of using remote read operations to transmit messages is that remote read operations make it relatively easy to ensure that each message is received and processed by the receiving system once and only once (i.e., idempotent). In most networked computer systems it is essential not to send the receiving system the same message twice. As already mentioned above, sending the same message twice could cause the receiving system to perform an operation twice that should only be performed once. Each message must be reliably received and processed by the receiving system exactly once to ensure proper system operation.
Remote write operations are relatively xe2x80x9cinexpensive,xe2x80x9d compared to remote read operations, in terms of system latency and system resources used, because the receiving CPU does not need to be involved in completing the write operation.
Referring to FIG. 1, there is shown a highly simplified representation of two prior art computer nodes herein called Node A 50, and Node B 52. The computer at each node can be any type of computer. In other words, the particular brand, architecture and operating system is of no importance to the present discussion, so long as each computer node is configured to operate in a networked environment. Each computer node 50, 52 will typically include a central processing unit (CPU) 54, random access memory 56, an internal memory bus 58 and one or more communications interfaces 60, often called network interface cards (NIC""s). The computer nodes communicate with each other by transmitting messages or packets to each other via a network interconnect 62, which may include one or more types of communication media, switching mechanisms and the like.
Each computer node 50, 52 typically also has a non-volatile, random access memory device 64, such as a high speed magnetic disk, and a corresponding disk controller 66.
In this example, each computer node is shown as having two communications interfaces 60 for connecting that node to the network fabric. Providing two parallel communication links improves system reliability, since failure of a node""s primary communication interface 60, or failure or disconnection of its cabling to the network interconnect, does not prevent the node from participating in network communications. In many systems, failure of a node""s communication link is tantamount to failure of the entire node, because the node is essentially useless to the system without its network connection. Providing a redundant network connections (herein called parallel links) is a well known strategy for addressing this problem.
A well known problem associated with the use of parallel links, is that the link failover mechanism must either avoid resending messages that have already been received and processed by the receiving system(s), or it must provide some other mechanism for ensuring idempotency (e.g., providing a receiver side mechanism for recognizing and discarding duplicate messages). The idempotency problem is not created by or unique to systems using parallel links; rather, the problem is exacerbated because the use of parallel links introduces additional opportunities for inadvertent retransmission of messages. For example, a link may fail after a message has been successfully transmitted, but before the receiving system has had the opportunity to acknowledge receipt or processing of the message. Alternately, the receiving system may have transmitted a message acknowledgment, but the acknowledgment may be lost due to improper operation of a damaged link. The present invention solves the idempotency problem in a manner that addresses the link failure problem.
FIG. 2 shows a simplified representation of a conventional communications interface (or NIC) 60, such the ones used in the computer nodes of FIG. 1, showing only the components of particular interest. The NIC 60 typically includes two address mapping mechanisms: an incoming memory management unit (IMMU) 70 and an outgoing memory management unit (OMMU) 72. The purpose of the two memory management units are to map local physical addresses (PA""s) in each computer node to global addresses (GA""s) and back. Transport logic 74 in the NIC 60 handles the mechanics of transmitting and receiving message packets, including looking up and converting addresses using the IMMU 70 and OMMU 72.
The dashed lines between the memory bus 60 and the IMMU 70 and OMMU 72 represent CPU derived control signals for storing and deleting address translation entries in the two MMU""s, typically under the control of a NIC driver program. The dashed line between the memory bus 60 and the transport logic 74 represents CPU derived control signals for configuring and controlling the transport logic 74.
Referring to FIGS. 3 and 4, the nodes in a distributed computer system (such as those shown in FIG. 1) utilize a shared global address space GA. Each node maps portions of its local address space into xe2x80x9cwindowsxe2x80x9d in the global address space. Furthermore, processes on each of the nodes map portions of their private virtual address space VA into the local physical address space PA, and can furthermore export a portion of the local physical address space PA into a window in the global address space GA. The process of xe2x80x9cexportingxe2x80x9d a portion of the local physical address space is also sometimes referred to as xe2x80x9cexporting a portion of the local physical address to another node,xe2x80x9d because another computer node is given read and/or write access to the exported portion of the local physical address space via an assigned global address space range.
It should be noted that the local physical addresses (e.g., PA1 and PA2) shown in FIGS. 3 and 4 are physical bus addresses and are not necessarily memory location addresses. In fact, many physical addresses are actually mapped to devices other than memory, such as the network interface. For example, when physical memory on a first computer is exported to a second computer, the physical addresses used in the second computer to write to the exported memory are not mapped to any local memory; rather they are mapped to the second computers network interface.
When data is written by a process in Node A 50 to a virtual address corresponding to a location in Node B 52, a series of address translations (also called address mapping translations) are performed. The virtual address VA1 from the process in node A is first translated by the TLB (translation lookaside buffer) 80-A in node A""s CPU 54-A into a local (physical) I/O address PA1. The local (physical) I/O address PA1 is then translated by the outgoing MMU (OMMU) 72-A in node A""s network interface 60-A into a global address GAx. When the data with its global address is received by node B (usually in the form of a message packet), the global address GAx is converted by the incoming MMU (IMMU) 70-B in node B""s network interface 60-B into a local physical address PA2 associated with node B. The local physical address PA2 corresponds to a virtual address VA2 associated with a receiving process. A TLB 80-B in node B""s CPU 54-B maps the virtual address VA2 to the local address PA2 where the received data is stored.
It should be noted that the term xe2x80x9cmessage transmissionxe2x80x9d is sometimes used to indicate or imply the use of a message transmission protocol in which the receiving system automatically processes the transmitted message, while the term xe2x80x9cdata transmissionxe2x80x9d simply indicates the writing or copying of data from one system to another. However, in this document, the terms message transmission and data transmission will be used interchangeably.
It should be noted here that TLBs generally only translate virtual addresses into local physical addresses, and not the other way around, and thus some of the arrows in FIG. 4 represent mappings rather than actual address translations. When the receiving process in the node B reads a received message at address VA2, the TLB 80-B will translate that virtual address into the same local address LA2 determined by the network interface""s IMMU 70-B as the destination address for the received message.
Address space ranges for receiving messages are pre-negotiated between the sending and receiving nodes using higher level protocols that typically use reserved address space, mailbox, or packet based communications that are set up for this purpose. The details of how windows in the global address space are assigned and how receiver side addresses are set up for receiving messages are beyond the scope of this document. Furthermore, the present invention does not require any changes in such communication setup mechanisms.
Receive buffers are allocated in conveniently sized chunks using a corresponding MMU entry. Larger receive buffers, or receive buffers of irregular size, may be constructed using multiple MMU entries by user level protocols. Once the receive buffers are allocated and the corresponding MMU mappings are established, user level programs can read and write to the receive buffers without kernel intervention. Many different kinds of user-level message passing xe2x80x9cAPI""sxe2x80x9d (application program interfaces) can be built on top of the basic receive buffer mechanism. This includes the send and receive Unix primitives, sockets, ORB (object resource broker) transport, remote procedure calls, and so on. The basic message passing mechanism is designed to be as xe2x80x9clight weightxe2x80x9d and efficient as possible, so as to take as few processor cycles as possible.
The present invention utilizes the local physical address to global address mapping mechanisms discussed above.
FIG. 5 shows the conventional procedure for a process on node A to write a message into a receive buffer at node B. The first step is for Node A to send a request to Node B to set up a receive buffer (also called exporting memory) so that Node A can write a message into it (step 100).
Node B then sets up one or more receive buffers and xe2x80x9cexportsxe2x80x9d the memory allocated to the receive buffer(s) to node A (step 101). In some implementations, this step may be performed in advance, because it is known in advance that Node A will be sending many messages to Node B. In other implementations, the memory exporting step is performed by a procedure in Node B that, before sending a method invocation message or the like to Node A, sets up a receive buffer to receive the results of the method invocation. The memory exporting step 101 is performed by creating an IMMU entry in Node B that maps the physical address range of a receive buffer in Node B""s memory to a corresponding range of global addresses and also by setting up a corresponding virtual address to physical address mapping. As indicated above, Node B will typically have a range of global addresses preassigned to it for exporting memory to other nodes. However, other mechanisms for assigning global addresses would be equally applicable.
Next, at step 102, a memory export message is transmitted by Node B to Node A that specifies:
the destination node to which the message is being transmitted;
the source node from which the message is being sent;
the global address corresponding to the receive buffer being exported to Node A; and
other parameters, such as protocol parameters, not relevant here.
At Node A, when the memory export message is received, Node A""s NIC driver sets up an OMMU entry to import the memory being exported by Node B and also sets up a corresponding virtual address to physical address mapping so that a process in Node A can write data into the receive buffer (step 104) . The OMMU entry set up at step 104 maps the global address range specified in the received message to a corresponding range of physical memory in the server node. If necessary (e.g., if insufficient contiguous memory is available and/or the size of the mapped address range is not equal to 2n pages), the server node will generate two or more OMMU entries so as to map the specified global address space to two or more local physical address ranges. The mapped local physical addresses in the first computer are not locations in that computer""s memory, rather they are otherwise unused addresses that are mapped to the computer""s network interface by the OMMU entry or entries.
Once the IMMU in node B and the OMMU in node A have been set up, node A can transmit a message to node B. The dashed line between steps 104 and 106 indicates that no particular assumptions are being made as to the timing relationship between steps 104 and 106 (i.e., one may closely follow the other, or they may be separated both in time and logistically).
Once node A is ready to send a message to node B, the message sending procedure in node A marshals the data to be sent to node B (step 106), which basically means that the data is formatted and stored in a send buffer in a predefined manner suitable for processing by an application procedure in node B.
Then a remote write is performed to copy the contents of the send buffer to the assigned global addresses (step 108). Writing data to a global address causes the sending node""s communication interface to transmit the data being written to the node associated with those global addresses, as indicated in the sending node""s OMMU entry (or entries) for those global addresses. This data transmission operation (step 108) may be performed under direct CPU control by xe2x80x9cprogrammed I/Oxe2x80x9d instructions, or it may be performed by a communications interface (NIC) DMA operation (i.e., in which case the DMA logic in the communication interface handles the transfer of data from local physical memory to the communications network).
Some communication networks and interfaces utilize what is known as an RMO (relaxed memory order) memory model, and can reorder messages so as the optimize the use of available resources. Also, many communication systems do not guarantee delivery of all messages handed off to them. Thus, there is no assurance that, once a message is sent, that it will actually be transmitted to the specified destination node, nor that it will be written into the receive buffer corresponding to the global addresses specified in the message. As a result, prior art computer systems are often designed to verify the transmission of each message before allowing any succeeding tasks to be performed. Such verification is typically achieved by performing a remote read (see step 110) so as to read at least a portion of the contents of the receive buffer in Node B, to determine whether or not the message was in fact written into the receive buffer.
Remote read operations are very expensive in terms of system latency, and communication system usage, because the thread in the sending system performing the remote read must wait for a request to be sent to the other node and for the response to be sent back before the thread can resume further processing. The resulting delay includes transmission time to and from the receiving system, access time on the remote system for accessing and invoking the procedure(s) needed to process the read request. Thus, remote reads tend to seriously degrade the performance of both the system performing the remote read and the communication system.
Remote write operations, on the other hand, are relatively inexpensive because the thread in the sending system performing the remote write simply delivers to its communication interface the data to be remotely written, and then proceeds with the next instruction in its instruction stream.
As indicated, after performing the remote write in step 108, the typical message transmission procedure will perform a remote read to verify transmission of the message to the receive buffer in Node B. If the remote read operation determines that the message was not successfully stored in the receive buffer, the remote write step (108) is repeated.
In some systems, once the remote write step 108 successfully completes, another remote write operation (followed by a corresponding remote read operation) may be performed to store a short message in a xe2x80x9creceived message queuexe2x80x9d in Node B. The short message typically contains a xe2x80x9ccookiexe2x80x9d or other data structure that indicates the location of the main message transmitted at step 108.
Finally, a trigger message is sent to Node B""s network interface (step 116), which triggers the execution of a procedure in Node B for processing received messages (e.g., by inspecting the received message queue to determine what new messages have been received, and so on).
At some point after the message has been sent and processed, the message sending thread in node A unexports the receive buffer it has used by tearing down or modifying the OMMU entry for the previously imported memory (step 118).
Node B, responds to either the receipt of the short message and/or the trigger message by processing the received short message and then the main data portion of a received long message, if any (step 120). In addition, or as part of step 120, Node B will also modify or tear down the IMMU entry for the receive buffer (step 122) so as unexport the receive buffer and enable write access to the receive buffer by a message processing application program.
As indicated above, there is an alternate message sending technique in which a message is written locally to a local buffer, a xe2x80x9ccookiexe2x80x9d or other notification message is sent to the receiving system, and the receiving system then performs a remote read operation to read the message from the remote memory location indicated in the notification message. This message transmission technique has the same basic problems, due to the use of remote read operations, as the message sending technique described with respect to FIG. 5.
Of course, the prior art includes many variations on the sequence of operations described above with reference to FIG. 5 for performing remote write operations. However, the steps described are typical for distributed computer system using UNIX(trademark) (a trademark of SCO) type operating systems, such as Solaris(trademark) (a trademark of Sun Microsystems, Inc.).
The present invention is a system and method for performing remote write operations, and for sending messages from one node to another in a distributed computer system. The distributed computer system typically has multiple computers or computer nodes, some of which may be part of a cluster of computer nodes that operate as a single server node from the viewpoint of computers outside the server cluster. At least some of the computers contain parallel communication links or interfaces for connecting those computers to other computers in the system.
A first computer sends a sequence of messages to a second computer using remote write operations to directly store each message in a receive FIFO in the second computer. Each message contains a semi-unique sequence number. The second computer, when processing the messages, retains information denoting the sequence numbers of the messages it has received and processed. The second computer also acknowledges each received message with an asynchronous acknowledgment message, and the first computer keeps track of which messages it has sent but for which it has not yet received an acknowledgment.
Whenever the first computer determines that it has failed to receive a message acknowledgment from the second computer in a timely fashion, or it needs to reuse previously used message sequence numbers, the first computer undertakes remedial actions to resynchronize the first and second computers. The process begins by prompting the second computer to flush and process all the messages in its receive FIFO, and then comparing sequence number information recorded by the second computer with the sequence numbers of the outstanding, unacknowledged messages sent by the first computer. If the comparison indicates that any messages sent by the first computer were not received and processed by the second computer, those messages are re-transmitted. If necessary, during resynchronization the first computer will activate a different communication interface than the one previously used so as to establish a reliable connection to the second computer.
Once it is established that all previously sent messages have been received and processed by the second computer, normal xe2x80x9csend onlyxe2x80x9d message operation resumes. The resynchronization process ensures that each message is received and processed by the second computer once and only once. At predefined times, such as the successful conclusion of a resynchronization, the sequence number information retained by the second computer is cleared.