Distributed computation technologies are continuously extending to various application areas. Distributed data stream processing systems have been widely used in many fields, such as financial management, network monitoring, communication data management, web applications, and sensor network data processing. A distributed stream processing system refers to a network software system that uses a distributed hardware system to process data stream services, and fault tolerance of the distributed stream processing system refers to a capability of providing a correct service for the external environment even when a fault occurs inside the system. A fault tolerance method for the distributed stream processing system is a main means for enhancing reliability and availability of the system. When some of working nodes inside the system fail, the system can automatically recover from the failure, and normal operation of an entire application system which relies on the distributed system is not severely affected.
Conventionally, the following fault tolerance methods are generally utilized in a distributed stream processing system:
(1) A distributed stream processing system uses centralized data backup, which is specifically that: data is backed up on a source node; after a stream processing network of the system is recovered, the source node resends a segment of data that was sent before the failure; and each working node in the system again receives and processes the data resent by the source node or each upstream working node.
(2) A distributed stream processing system uses distributed data backup, which is specifically that: each working node in the system backs up data that was processed during a previous time period; after a stream processing network of the system is recovered, each working node in the system resends the backup data to each downstream working node; each downstream working node again receives and processes the data resent by each upstream working node, and sends a data processing completion message to the upstream working node after successfully receiving the data resent by the upstream working node. The upstream working node deletes the backup data after receiving the data processing completion message from the downlink working node.
In the foregoing method (1), after the stream processing network of the system is recovered, the entire network of the system needs to be rolled back; the source node resends data, and every other working node again receives and processes the data resent by the source node or an upstream working node, which reduces data processing efficiency of the entire network, and wastes node resources. In the foregoing method (2), each working node needs to back up processed data, which results in large storage overheads. Besides, frequent interaction is required between an upstream working node and a downstream working node, which ultimately leads to low data processing efficiency.