This invention relates to the field of computer systems. More specifically, a system, article of manufacture and methods are provided for ensuring delivery of messages between computer nodes of a multi-node environment.
Computer systems exchange communications for numerous reasons. One computer system (e.g., a web server) may be configured to service information requests from and provide data to another computer system (e.g., a client). In other environments, such as closely coupled configurations of computer systems, multiple nodes may be inter-connected and cooperate to share access to one or more common resources (e.g., a storage device, a communication device, a network service).
In virtually all environments in which computer systems communicate there is a need to ensure delivery of the communications. Systems in many environments, particularly those in which the systems are distributed (e.g., a network), employ standard communication protocols such as TCP/IP (Transport Control Protocol/Internet Protocol) to maintain uniform communication formats and to perform flow control, error correction and other features. A standard communication protocol such as TCP/IP, however, often incorporates many features not needed in less-distributed, customized, or specialized environments.
For example, in a closely coupled environment such as a cluster, computer systems may be in proximity to one another and thus have no need for many of the services/benefits of a standard network communication protocol for all of its intra-cluster communications. A customized format may be more efficient, for example, when the nodes are directly connected to each other, wherein extraneous protocol data or features may be omitted for the sake of communicating more information in less time.
However, in any type of environment in which computers exchange communications it is still necessary to ensure that information, data requests and other messages sent from one node to another are successfully received. A standard protocol such as TCP may employ a timeout feature whereby a message is automatically re-sent after a certain period of time if a destination node does not acknowledge its receipt. This scheme may result in the destination node receiving multiple copies of a single message. This may decrease the efficiency of the communication medium and, because the destination node must process and act on each message, impact operations on the destination node. Additionally, it may be undesirable for the destination node to carry out, multiple times, whatever action may be required by the message.
Using timeouts as part of a method to ensure delivery of communications may be even more inefficient in specialized environments such as computer clusters. Communication links between nodes in a cluster are often relatively short in length and are frequently dedicated to a limited number of devices. Thus, applying a timeout feature on such a link would tend to have a negative effect on communication throughput. In addition, it may be even more critical in a cluster that a message only be received and applied once on a destination node. For example, in a cluster in which access to a resource is managed by one node, a request to alter data on the resource should only be applied once. Thus, re-sending a request numerous times would be detrimental, even in a situation in which the node controlling the resource failed and was replaced by another node.
Thus, it important to ascertain the status of a message sent from one computer node to another, so that appropriate corrective steps may be taken if the message is lost. However, the transport mechanism of a computer node (e.g., a module that applies TCP) may be unable to accurately determine the status of a message and/or take the necessary steps to ensure its delivery. In particular, the transport mechanism or module in a specialized environment such as a cluster may be of a custom design and may not be configured to automatically re-send a message that may have been lost.
A transport mechanism of a computer node that originates a communication may be able to identify or report clear successes (e.g., to an originator of the communication), as when the receiving node acknowledges receipt of the communication. And the transport mechanism may be able to report clear failures, as when the mechanism fails to transmit the communication. However, the mechanism may be unable to characterize instances in which it transmits a communication but does not receive an acknowledgement. In these cases it may fall to some module above the transport mechanism (e.g., the originator) to determine if the communication was successfully received at the other node. It may also be advantageous for the originator of a communication to ensure its delivery in order to save the recipient of the communication from having to take action on several copies of the communication.
In some existing methods of communicating between computer nodes, an originating node may send multiple copies of a transmission to a destination node to ensure that at least one is received (e.g., particularly if an initial copy is lost). Systems employing these methods usually just discard extra copies at the destination node and it does not matter which copy is actually processed at the destination. However, some computing environments require communications between nodes to be highly reliable or accountable. For example, in an object-oriented computing environment in which references to an object are tracked or monitored, a node""s resources may be allocated or tied up until all references (e.g., including communications to/from other nodes) to the object are resolved. In such an environment it is necessary to ensure that only one version of a communication that references a particular object is successfully sent to and received at a destination node, and to know which version of the communication was successful, so that a node""s object references can be accurately managed.
Thus, what is needed is a system and method of actively ensuring delivery of a single message or communication from one node to another. In particular, such a system and method should be able to ensure delivery in situations in which a transport mechanism cannot assure the message originator that the message failed or succeeded. Such a system and method may be particularly suited to closely coupled and/or highly available computing environments in which it is desirable to avoid repeating the message, but would be useful in any computing environment in which a computing device""s transport mechanism is unable to ensure delivery of a communication.
In one embodiment of the invention a system and methods are provided for ensuring a single communication or message from an object handler on one node is delivered to an object handler on a second node.
In this embodiment an object handler on one node receives an object reference from a higher-level service (e.g., a file system, a network service) concerning an object on a second node. The object handler generates a message or other communication concerning the reference, assigns it a unique identifier (e.g., a sequence number) and passes it to a transport module for delivery to the second node. The object handler maintains status indicators for the messages it sends to the second node and updates an indicator corresponding to the message just sent if the transport module reports that the message was successfully received by the second node.
If, however, the transport module cannot report a definite status (e.g., success or failure) of the message then the originating object handler takes additional action to determine the message status. In one embodiment it issues a query or management message to the object handler on the second node, which query message includes the identifier of the original message. If the destination object handler did not receive the message (and update its status indicator(s)) before the query is received, it informs the originating object handler that the message was not received. In this case both object handlers store or otherwise make note of the identifier of the original message, which may be lost. The originating object handler then sends a new or repeat version of the message, but with a different identifier. In one embodiment of the invention message identifiers are sequence numbers large enough in magnitude so that they rarely, if ever, repeat.
Each object handler compares the identifiers of messages it receives from a node against all identifiers (if any) of lost messages associated with the sending node. If an identifier of a received message matches a lost message identifier, the received message is discarded. Thus, if the destination object handler receives the original message after being queried, the message will be discarded. In addition, before a sending node assigns a sequence number or other identifier to an outgoing message, it first ensures that the identifier does not match any of the stored identifiers of lost messages.
In one embodiment of the invention a first node that sends communications or messages to a second node stores identifiers of its lost messages in a table (e.g., a table hashed by the corresponding identifiers). The first node may also store the identifiers of messages that could not be transmitted from the first node (e.g., because of a communication link failure or other hardware failure).
In this embodiment of the invention the second node employs multiple cooperating data structures to track and verify the status of the communications generated for the second node from the first node. In particular, more recent communications are reflected in a vector containing multiple entries, one entry per communication. Each vector entry includes two indicators, one to reflect whether the corresponding communication was received at the second node and another to reflect whether the first node rescinded the communication. In this embodiment a communication is considered rescinded if the second node receives a query message about an earlier communication and the earlier communication has not been received by the time the query message is received.
The second node tracks the status of older communications in another structure, such as a table. Each entry in the table corresponds to one communication addressed to the second node from the first node but which has not been received at the second node. As more and more communications are sent from the first node to the second node, older entries in the vector are removed and, for removed entries corresponding to non-received communications, entries are made in the table. In this embodiment of the invention the vector is extended or expanded when necessary by allocating additional memory space on the second node. Periodically, the first node may send to the second node the identifiers of the communications that it could not transmit, so that the second node may remove the corresponding entries in its table.