1. Technical Field
The present disclosure relates to data streaming and, more particularly, to methods and systems for fault-tolerant distributed stream processing.
2. Discussion of Related Art
Real-time data processing systems have been developed for a variety of uses and applications. Distributed computer systems have been widely employed to run jobs on multiple machines or processing nodes, which may be co-located within a single cluster or geographically distributed over wide areas, where some jobs run over long periods of time. During the lifetime of a job, machines and infrastructure, e.g., various hardware and software, can arbitrarily fail, yet many stream processing applications require results to be continually produced, which means that the system may need to continue operation and make forward progress in the computation even after one or more components and/or processes have failed. In spite of advances in middleware, data processing on highly distributed, and often faulty, infrastructure can be challenging. In principle, the underlying middleware should restart or reschedule tasks after transparently recovering from the failure events such that processing continues where the failed job left off.
A variety of fault-tolerant techniques and systems for processing data streams, even in the face of high and variable input data rates, have been developed, e.g., to meet the demands of real-time applications. Some systems implementing fault-resilient processing break up a computational job into many tasks which can be processed independently to achieve desired reliability. Each task has a defined input set and generates an output set. A job coordinator ensures that all tasks are executed at least once and all results are computed before the next, dependent job gets started. An example of a system and method for large-scale data processing including operations for automatically handling fault-recovery is MapReduce, disclosed in U.S. Pat. No. 7,650,331, entitled “System and method for efficient large-scale data processing”.
In a streaming environment with a distributed computational model, data is continuously injected into a set of operators (e.g., filters, aggregates, and correlations) which then produce result sets that are either sent to applications or other nodes for additional processing. When a stream goes from one node to another, the nodes are generally referred to as upstream and downstream neighbors. Typically, for efficiency reasons, data cannot be persisted to disk such that all operations can be restarted. In the event of a machine or process failure, the current state of the operator is lost and needs to be recovered through other means than persistent storage to continue processing.
A common approach to achieve fault tolerance involves the introduction of a redundant operator, which must be located on a distinct machine to survive a node failure. Using this approach, all data pushed from operator to operator must be replicated, and successful receipt of data must be confirmed to achieve fault tolerance. A replication-based approach to fault-tolerant distributed stream processing generally requires at least twice the computational resources and network bandwidth in the common mode of operation. The job scheduler needs to be aware of the peers and must not place the peers on the same node, which generally increases the complexity of the job scheduler. In the case of large systems, a replication-based approach scales poorly with increasing cluster size due to limitations of the switching fabric. Frequently-communicating nodes should be placed in proximity, and yet, for reliability reasons, should use different switching infrastructure. To achieve that, a full replication approach places substantial burden on the switching infrastructure.
In distributed stream processing systems, a node failure or switch outage may cause failure of one or more operators in the stream, which, in turn, may fail the stream process. The system must consider all streams of a node in the event of a node failure. One common way of solving the reliability issue is by introducing operator redundancy, such as with a primary and secondary operator. If the primary operator fails, then the secondary operator takes over the operation and the infrastructure creates a new backup replica. If the secondary fails, then the system just creates a fresh replica. The system needs to keep primary and secondary in lock-step to allow take-over in case of failure.