1. Technical Field
The present invention relates to fault-tolerance of distributed data-processing systems and applications, and more particularly to a system and method for dependent failure-aware allocation of distributed data-processing systems
2. Description of the Related Art
In networked infrastructures such as large clusters, distributed computing testbeds or the Internet, a large group of machines can all fail or lose connectivity together due to anomalies such as network disconnection, power failures, or internet-scale viruses and worms. Given that the distributed system/application has to run in this environment, the assignments of individual components of the distributed application to resources across the network has to be determined so that when groups of resources fail together, the output of the application is minimally affected. Here, the output of the system or application is the set of results produced as a result of data processing transformations applied by its components.
Hwang et al., in “High-Availability Algorithms for Distributed Stream Processing”, The 21st International Conference on Data Engineering (ICDE 2005), Tokyo, address the issue of fault recovery in stream processing systems by using techniques such as storing data and checkpointing. Research in stream processing systems has addressed the issue of resource allocation more for the purposes of making optimal use of available resources. Pietzuch et al., in “Network-Aware Operator Placement for Stream-Processing Systems”. Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), 2006, have devised an algorithm to allocate resources in the wide area to optimize network usage of a stream processing system.
Cetintemel et al. in “Providing resiliency to load variations in distributed stream processing”, Proceedings of the International Conference on Very Large Databases, 2006, describe a method to allocate resources in stream processing systems that tries to maximize system usage while not overloading individual machines. While both these are relevant given that they are addressing issues of resource allocation in stream processing environments, they do not consider the possibility of failure in their allocation strategies.
Rhee et al., in “Optimal fault-tolerant resource allocation in dynamic distributed systems.” Proceedings of the 7th IEEE Symposium on Parallel and Distributed Processing, 1995, describe an algorithm to allocate components to resources for high fault-tolerance in message passing systems. It considers a number of application/system components that are all vying for the same set of resources. There is an inherent assumption here that the number of resources is smaller than the number of components, and that the components have to share the resources in a time-ordered fashion. The algorithm described in Jee et al. minimizes the number of components waiting on that resource to be freed.
The Phoenix recovery system described in Junqueira et al., “Surviving Internet Catastrophes.” Proceedings of USENIX Annual Technical Conference, May 2005, shows that individual “processes” can be grouped into “clusters” based on which processes tend to fail together, all processes that can fail together being grouped into one cluster. The authors use this model to build a replication strategy for large-scale systems that preserves the system in the face of internet-wide virus and worm attacks. In particular, they form replica sets consisting of processes picked from many different heterogeneous clusters. This and other work on fault-tolerance in distributed systems assume an all-or-nothing failure model for applications. That is, if any of the application components fail, the whole application fails and its output is zero.