Today increasingly, a complex large-scale computing environment is composed of multiple systems. These systems are independently developed and interact with each other, as well as interact with the user(s). Each of these systems in the environment manages a great number of resources, which may include hardware devices, applications, and a range of other components and capabilities needed for system and resource management. Techniques such as load balancing may be employed to ensure that no particular resource or resource group is over-used or too heavily relied upon, however this may lead to difficulties in highly available systems.
A challenge in highly available systems is to not only mitigate the impact of physical failures and maintenance, but also to defend against software, application, and resource management failures.
A highly available system may, in the case of a complete or partial failure or system corruption or crash, immediately “fail over” or otherwise move or re-allocate processes and workloads from failed or compromised computing resources to healthy ones. In the case of crashes or failures or other system compromises arising from software bugs or otherwise faulty or corrupt system states, such “fail over” may trigger a cascading failure of an entire system.