Distributed networks are utilized today in various contexts, for example, for file sharing or voice-over-IP. The distributed networks include Grid, Cloud, Distributed Database and Peer-to-Peer (P2P) systems. They demonstrate the power of decentralized and self-organized resource location and usage in a flexible way.
A distributed network 100 includes, as shown in FIG. 1, a high number of nodes 102 (compared to conventional client-server networks) which are inter-connected to other nodes. To enable consistency and provide a designed application or service to a node, usually, the availability of nodes, links and/or resources (applications or services) is necessary to be known by the other nodes. Thus, in such a distributed system, the availability of nodes, links and/or resources is currently provided by some keep-alive (heartbeat) mechanisms in which short messages are exchanged periodically among the nodes to detect the failure/availability of nodes or links.
In other words, a node 104 sends a keep-alive message 106 to a neighbour node 108. If no reply is received at node 104 from the node 108, then node 104 assumes that node 108 is down (has failed). This is true for each node 102 of the network 100, i.e., each node of the network constantly probes other nodes to which it is connected. An important characteristic and also the main reason why the keep-alive mechanisms are used in the distributed networks is that the keep-alive mechanisms proactively allow the detection of a node or connection outage before these nodes and connections are needed by the underlying applications or services.
To enhance the availability of the nodes and/or services and to detect the failure as fast as possible, keep-alive messages need to be exchanged with a high frequency in the existing distributed networks. However, in strongly inter-connected, large-scale distributed network, heavy signaling and communication overhead among the nodes is introduced by the keep-alive mechanisms and, thus, the scalability of the network is limited. Therefore, there is a need for an efficient keep-alive and failure detection mechanism for the ever increasing distributed systems.
A couple of limitations of the existing mechanisms are now discussed. One mechanism used in the current distributed networks is the Basic Keep-alive (BK) mechanism as described by A. Rowstron and P. Druschel, “Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems,” in IFIP/ACM Middleware, 2001, and Mahajan et al., “Controlling the Cost of Reliability in P2P Overlays,” Proc. IPTPS 2003. In this mechanism, a keep-alive query is sent from one node 104 to each neighbour node 108, 110, 112, 114 and 116 over the network and a keep-alive reply message is sent back by each neighbour node 108, 110, 112, 114 and 116 to the querying node 104. When the keep-alive reply message arrives, the querying node 104 knows that the other node is still alive and the link is functional. The keep-alive message transferring is initiated periodically, in both directions, every k seconds and k is called the keep-alive interval. The set of nodes (108, 110, 112, 114 and 116) directly connected to a node x (104) is called the neighbourhood set (N(x)) of node x.
With the BK mechanism, each node is managed independently of all other nodes in the system. For example, two nodes 104 and 120, both connected to a third node 108, do not share any information regarding their common node 108, so the keep-alive task must be performed twice, once by node 104 and once by node 120 for determining that node 108 is alive. This will result in two keep-alive messages per k seconds arriving at node 108 from nodes 104 and 120. Of course, node 108 may be connected to other nodes that also send keep-alive messages, thus further increasing the amount of messages received by node 108.
Although the BK mechanism is intuitive and easy to implement, the increasing of the system size or the inter-connection degree introduces a large amount of additional keep-alive signalling traffic, which degrade the performance of the distributed system.
To address this limitation of the basic keep-alive mechanism, Dedinski et al. (“Cooperative Keep-Alives: An Efficient Outage Detection Algorithm for P2P Overlay Networks,” Peer-to-Peer Computing, 2007) have proposed a Cooperative Keep-alive (CK) mechanism. In this mechanism, all the nodes from the neighbourhood set of a target node continuously send keep-alive requests to the target node and the target node is configured to reply to the nodes from the neighbourhood set to ensure that the target node is still alive. The requests are sent with a certain frequency, controlled by the target node. The goal of the target node is to ensure that the frequency of all incoming keep-alive requests is close to the desired constant interval k, independently of the (usually changing) size of its neighbourhood set.
This is achieved by running two tasks at every node in the system, a sender task and a receiver task. Because every node in the network runs both tasks, the system is symmetric, i.e., there are no client or server roles. The main function of a sender task at a given node is to send keep-alive requests to the receiver tasks of the nodes in the neighbourhood of the given node, at pre-set times, and to process the replies. The sender task has a timetable, called sender schedule, in which the sending times are stored. The time for sending the next request to a particular neighbour node is extracted from the last keep-alive reply from that neighbour node. If the sender task of the given node sends a request to another node and the given node does not receive a reply from the another node, the request is repeated, at most r times, where r is a pre-defined retry count. After r retries, the sender task of the given node detects an outage of the another node and broadcasts this information to all neighbours of the another node by a sequential flooding technique.
Though the above mechanism reduces the amount of keep-alive messages exchanged among the nodes compared with the basic keep-alive mechanism, it still has the following disadvantages. First, the keep-alive message is unidirectional, i.e., the neighbour of the given node needs to actively send a request message to retrieve the status of all its neighbours.
Second, after a node fails, the keep-alive information is send by the given (detecting) node to all known neighbour nodes of the failed node. However, when sending out such information, it is possible that the given node cannot directly communicate with all the known neighbour nodes of the failed node. In this case, these nodes do not receive the failure information and can only detect the failed node by themselves, which require further messages. Thus, according to this mechanism, it will take longer for those nodes to detect the failed node. Such problem is neither considered nor solved by the above mechanism.
Thus, there is a need to develop a new and efficient keep-alive and failure detection mechanisms that reduce the failure detection time and the signalling cost in large-scale, distributed networks or systems, and, at the same time, preserves the effectiveness and reliability of the basic keep-alive mechanism.