The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for recovering from a fault in a system environment with different interdependent software and hardware layers.
Due to a stateful nature of current software, such software is prone to failures in associated underlying technology stack, such as hardware, operating system, middleware, or the like. That is, if underlying hardware fails, corresponding state information may be lost. This is normally disruptive to users working with an application utilizing the hardware. Such issues become even more challenging in multi-tenant cloud computing environments where many users are running on a shared hardware and software infrastructure. In an event of a failure in one of the components within the overall technology stack, such as hardware, hypervisor, operating system, middleware, application, or the like, corrective action must be taken to minimize the number of users affected and reduce any impact.
Current approaches to address such failures are provided by establishing high availability clusters. However, such approaches are typically complex and costly due to distributed states across many nodes. Therefore, issues remain with handling incidents & failures occurring in a shared infrastructure in an efficient fashion. Specifically, the shared nature of cloud computing environments, where many users share a technology stack consisting of servers, storage, network, operating systems, middleware, applications or the like, raise new challenges.