The present invention relates generally to component-based applications, and relates more specifically to the deployment of fault tolerance techniques in stream processing applications (a particular type of component-based application).
The stream processing paradigm is employed to analyze streaming data (e.g., audio, sensor readings and news feeds, financial transactions, and events from manufacturing plants, telecommunications plants, or water distribution systems, among others) in real time. An example of a stream processing system is the INFOSPHERE STREAMS middleware commercially available from International Business Machines Corporation of Armonk, N.Y., which runs applications written in the Streams Processing Language (SPL) programming language.
High availability is critical to stream processing systems, since they process continuous live data. Developers build streaming applications by assembling stream operators as data flow graphs, which can be distributed over a set of nodes to achieve high performance and scalability. A fault in a computing node or in a stream operator can result in massive data loss due to the typical high data rates of incoming streams.
While many fault tolerance techniques for stream computing guarantee no data loss, partial fault tolerance techniques aim to reduce the performance impact imposed by the additional logic required for ensuring application reliability by assuming that a certain amount of stream data loss and duplication (i.e., multiple delivery of the same data item) between stream operators is acceptable under faulty conditions. Partial fault tolerance techniques avoid full replication of the stream processing graph, either by replicating only part of its components, or by avoiding checkpointing the whole state of the application (i.e., the internal state of the stream operators and the state of the communication channels). The rationale is that many streaming applications tolerate data imprecision by design, and, as a result, can still operate under data loss or duplication.
Although more efficient resource-wise than techniques that guarantee no data loss, partial fault tolerance is not viable without a clear understanding of the impact of faults in the application output.