The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for recovering a set of distributed clustered systems back to a normal state.
Distributed clustered systems are deployed on more than one host machine or node—each machine or node having one or more processors, memory, and (optionally) persistent storage, such as a hard disk, solid-state drive, or the like. The machines may be physical machines, virtual machines, Linux® containers, or the like. The machines or nodes are connected over a network, such as a physical network, virtual network, software defined network, or the like. Hadoop is one example of such a distributed clustered system, as the Hadoop processes are distributed over multiple machines or nodes, i.e. a cluster of machines or nodes.