The present invention relates generally to component-based applications, and relates more specifically to fault tolerance techniques for stream processing applications, which are component-based applications.
Stream processing applications have emerged as a paradigm for analyzing streaming data (e.g., audio, video, sensor readings, and business data) in real time. Stream processing applications are typically built as data-flow graphs comprising interconnected stream operators that implement analytics over the incoming data streams. Each of these operators is a component.
During operation of a stream processing application, a stream operator may fail (i.e., stop executing its operations or responding to other operators) for any one or more of several reasons, including, but not limited to: a heisenbug (i.e., a computer bug that disappears or alters its characteristics when an attempt is made to study it) in the stream operator code (e.g., a timing error), a node failure (e.g., a power outage), a kernel failure (e.g., a device driver crashes and forces a machine reboot), a transient hardware failure (e.g., a memory error corrupts an application variable and causes the stream processing application to crash), or a network failure (e.g., the network cable gets disconnected, and no other node can send data to the operator).
Fault tolerance techniques of varying strictness are used to ensure that stream processing applications generate semantically correct results even in the presence of failure. For instance, sensor-based patient monitoring applications require rigorous fault tolerance, since data loss or computation errors may lead to catastrophic results. By contrast, an application that discovers caller/callee pairs by data mining a set of Voice over Internet Protocol (VoIP) streams may still be able to infer the caller/callee pairs despite packet loss or user disconnections (although with less confidence). The second type of application is referred to as “partial fault tolerant.” Moreover, in some stream processing applications, it is better to produce partial results sooner rather than to produce complete results later.