It is now common for a large variety of software applications to run as services on cloud platforms comprising a distributed network of servers or on an on-premises enterprise datacenter. It is a requirement for these services to maintain a high availability to customers and tenants. Satisfying this requirement is a complex problem due to the distributed nature of cloud-based applications and the non-trivial inter-dependencies of the services' components on each other.
A common approach for testing the availability of services in a datacenter is to manually create fault models for a service and then analyze the impact of various component failures. This approach has several drawbacks. Creating accurate fault models takes time and becomes prohibitively expensive if the functionality, architecture, and/or dependencies change often. When many factors affect functioning of a complex, distributed system, then manually created fault models are likely to miss many combinations of such factors. Human error and a lack of knowledge of all the dependencies for each component is likely to result in important failures having high customer impact from being included in the fault models. Additionally, independently created fault models for different components that are not updated often enough may not detect new dependencies between separate services and will likely miss many failure scenarios.
Accordingly, the rapid development and deployment of modern software wherein new dependencies are unknowingly added and removed makes the above-mentioned approach unpractical.