Distributed data processing systems or particularly real-time streaming systems are becoming more and more popular. Modern real-time streaming systems, such as Storm, Pivotal Spring X, Spark Streaming, Samza, and among others, are widely applied to e-commerce, big data analysis, and data Extracting, Transforming and Loading (ETL). It is quite common and important to provide a reliable processing capability so that each message is guaranteed to be processed even with any failure in nodes or networking. One of key challenges for such a distributed system is how to detect a failure existent in the system in an efficient manner with lowest cost and performance impact, particularly for a large system with thousands of nodes and interconnections.
The prior art generally has the following methods to solve the problems above. One method is to report from each processing node to a tracking task a status for each message as issued. Then, the tracking task would maintain the status by tracking the each message as issued and the relationships between nodes. Given a timeout setting, each derived message will be processed. This method is straightforward but inefficient. For each input message, each node will suffer extra report traffic, and the logic for the tracking task will be quite complicated, such that it may possibly consume considerable central processing unit (CPU) and memory resources. As will be described in FIG. 2, another method of improvement is referred to an XOR-based algorithm. This algorithm can dramatically lower the complexity of the tracking task and the consumed memory resources, but still have various problems such as scalable restrictions, considerable network traffic overheads, and end-to-end delay.
Therefore, a more effective and scalable method is desirable in the field to solve the problems above.