Distributed computing systems are in particular designed for enabling parallel computation. A given application is seen as a topology of nodes, each of them executing tasks. Messages passing or exchanging between nodes ensure that the data flows through the topology and input data are processed in such a way to produce the needed output data. Examples of distributed computing systems are Hadoop, which is batch-oriented and Storm, which is stream-oriented.
Most of the distributed computing systems have to be highly available for a certain time period during the day, e.g. a stock exchange ordering computing system or even 24/7, e.g. a super computing cluster. Therefore one of the major problems faced with distributed computing systems is node failure. Whenever a node fails, two actions usually are required: First there is the need to restore the status of the node before the failure on a new node. Second, there is a need to update the topology of the application so that data can flow through the new node, i.e., the node that has replaced the one that failed. After completion of these two actions, computation can resume and continue.
To overcome this problem conventional solutions provide a recovery from the node failure at the expense of an increased number of resources being used, in terms of network input/output, central processing unit resources and memory resources as well as spare nodes that have to be kept available.
These conventional techniques comprise for example an instantiation of a new topology and restarting the whole computation from the beginning on the initial input data. Such a conventional technique is the default one adopted by Storm. Another conventional technique applies redundant computation techniques, such as active or passive standby. In active standby for example the distributed computing system launches the same task on multiple nodes and the results of the task can be taken from a redundant node in case of node failure. In passive standby the status of a node including the status of all its output queues is copied to other backup nodes, which stay idle and eventually replace the node in case it leaves the distributed computing system, in particular in case of a failure.
Further a conventional technique is the so-called upstream backup technique, adopted by the platform Borealis. Given the topology according to FIG. 1 in where a node a is connected to a node b and a node b is connected to node c in a sequential manner, in the upstream backup technique the node a keeps messages for the node b until the node c receives the output messages computed by node b. The node b therefore maintains information about the relations of all <input, output> messages: The node b waits for an acknowledgement from node c about a given output message that node b sent to node c before sending an acknowledgement related to the corresponding input message to node a. At this point node a removes the acknowledged message from its output queue. In case of a failure a new node with clean state/status takes over, and the status before the failure is recomputed by replying all messages held by the upstream node a.
However one of the drawbacks is, that there is no way to pick up a computation right where it was left in case of a failure without introducing expensive redundancy and strategies to choose which tasks have to be duplicated.