This invention relates, in general, to distributed processing, and in particular, to providing fault tolerance in distributed systems.
Fault-tolerant and dependable, large-scale distributed systems are difficult to build because multiple components or network services are employed, and local failures at a particular component of a given service may be very disruptive to the whole system. This is particularly true for middleware that aims to simplify the process of constructing large-scale, distributed applications ranging from low-level infrastructure, such as MPI (Message Passing Interface) and PVM (Parallel Virtual Machine), to Websphere, and web-services based architectures.
To carry out an operation in a large distributed system, typically a chain of activity is triggered across several tiers of distributed components (e.g., from the web front-end to a database system to a credit card clearinghouse component, and so on).
Each component exposes interfaces that other components can invoke remotely. These inter-component operations may be idempotent in that multiple invocations of the same operation does not affect the state of the component, or non-idempotent in that the operation may yield a state change of the component each time it is invoked.
In the current state-of-the-art, one of the techniques for dealing with a failure (i.e., a failure in one component) resulting from a non-idempotent inter-component operation requires rollback operations in one or more components. This technique is cumbersome at best and impossible to use in other cases (e.g., some components may not have the ability to rollback at all). Other approaches rely heavily on the existence of reusable replicas which raise a set of complicated problems in terms of distributed state consistency.