Distributed systems are becoming more complex in their components and the interactions between their components. For example, a distributed system that provides a web interface for its users may include a front-end system that receives requests and sends responses, a back-end system that handles the requests and generates the responses, and a database system that stores and retrieves the data of the user and data of the distributed system. Each of these systems may have many components. For example, a front-end system may include a load-balancing component, a Representational State Transfer (“RESTful”) interface, a Simple Object Access Protocol interface, an Electronic Data Interchange Interface, performance monitors, security components, and so on.
In addition, the number of users of a distributed system can be very large—in some cases over one billion users. Because of the large number of users, such a distributed system may need to be deployed on thousands of computers located at data centers throughout the world. In addition to the systems and components described above, a distributed system may also include systems to automatically allocate additional computational resources as needed, deploy updates to the components, implement failover systems in case of failure, and so on.
The developers of these distributed systems go to great lengths to ensure that the distributed systems are resilient to failures. A failure of even a single component can cause a cascade of failures in other components of the distributed system. For example, a failure of a load balancer of a front-end system can cause all traffic to be routed through a small number of computers, which may cause the back-end system that handles the requests from those computers to become overloaded and fail, and so on. A distributed system is considered to be resilient to a failure when the distributed system can take steps to counteract the failure with little or no perceptible impact on system performance. For example, if the front-end system detected the failure of the load balancer and automatically routed network traffic through a backup load balancer, the distributed system would be considered resilient to the failure of the primary load balancer. Because of the complexities of these distributed systems, it is virtually impossible to ensure that they will be resilient to all types of possible failures.
To help ensure that a distributed system is resilient, various approaches to testing the resiliency have been used. These approaches generally test a distributed system while it is in production that is processing real data for users. In one approach, the provider of the distributed system manually generates failure scenarios in which the distributed system may fail. The provider then tests these failure scenarios to verify that the distributed system is resilient. A disadvantage of this approach is that it can be very time-consuming to generate the failure scenarios. As a result, the testing may be less than comprehensive. Furthermore, the failure scenarios may need to be modified whenever the configuration of the distributed system changes. In another approach, a provider may test a failure scenario (e.g., loss of power of a machine) on random percentages of machines to verify the distributed system is resilient. A disadvantage of this approach is that simply varying the percentage of machines may not be able to detect failures that depend, for example, on different intensities of the failure scenario on different machines.