1. Technical Field
The present invention relates to failure management in distributed systems and more particularly to systems and methods for using continuous failure predictions to provide proactive failure management for distributed cluster systems.
2. Description of the Related Art
Many emerging applications call for sophisticated real-time processing over dynamic data streams such as sensor data analysis and network traffic monitoring. In these applications, data streams from external sources flow into a data stream management system where they are processed by different continuous query (CQ) operators (e.g., join, union, aggregation). Distributed cluster systems have been developed to achieve scalable CQ processing. However, cluster systems are vulnerable to various software and hardware failures where any node failure may stall the query processing. For example, in a deployed cluster system consisting of several hundreds of processing nodes, the system log records 69 significant failure incidents during a one month period.
On the other hand, fault tolerance is of critical importance for CQ processing that requires continuous system operation. System administrators are often overwhelmed by the tasks of recovering the system from failures under time pressure. Thus, it is imperative to provide autonomic failure management for large-scale cluster systems to achieve fault-tolerant CQ processing.
Previous work on fault-tolerant cluster systems has been focused on developing specific failure management schemes. The Flux operator in Shah et al. “Highly-available, fault-tolerant, parallel dataflows”. Proc. of SIGMOD, 2004, encapsulates coordination logic for fail-over and recovery into an opaque operator used to compose parallelized dataflows. Flux achieves highly available dataflow processing by allowing the processing of unaffected partitions to continue while recovering the operations of failed partitions. The Borealis project described in Abadi et al., “The Design of the Borealis Stream Processing Engine.” Proc. of CIDR, 2005, also extensively addressed the problem of fault tolerant stream processing. Balazinska et al. in “Fault-Tolerance in the Borealis Distributed Stream Processing System.” Proc. of SIGMOD, 2005, proposed a replication-based approach to fault tolerant distributed stream processing, which allows a user to trade availability for consistency.
Hwang et al. in “High Availability Algorithms for Distributed Stream Processing.” Proc. of ICDE, 2005, studied various failure recovery approaches and quantitatively characterized their runtime overhead and recovery time tradeoffs. Hwang et al. in “A Cooperative, Self-Configuring High-Availability Solution for Stream Processing.” Proc. of ICDE, 2007 also proposed a parallel backup and recovery approach that performs cooperative checkpointing and failure recovery for one query using a set of servers.
Failure prediction has been studied under different contexts. The SMART project described in Murray et al. G. F. “Comparison of machine learning methods for predicting failures in hard drives.” Journal of Machine Learning Research, 2005, has studied different nonparametric statistical approaches to disk failure predictions. Since prediction errors for disk failure cause significant financial penalty, the focus of SMART is on selecting proper features to avoid false-alarms as much as possible. Software rejuvenation as described in Vaidyanathan et al. “Analysis and Implementation of Software Rejuvenation in Cluster Systems.” Proc. of SIGMETRICS, 2004, is another proactive technique that periodically stops running software, clean its internal state, and restarts it to prevent unexpected failures due to software aging.
Cai et al. in “Utility-Driven Proactive Management of Availability in Enterprise-Scale Information Flows.” Proc. of Middleware, 2006, proposes a utility-driven approach to augmenting the active-passive scheme with tunability between normal operation cost and recovery.