Services (e.g., web services, database services, etc.) are often implemented through large-scale systems with many servers. The large number of hardware and software components on these servers provides many potential points of failure. For example, a disk or network card could stop working, or a program could crash. A provider of such a service may want to evaluate how various kinds of failures affect availability of the service. For example, one type of failure may cause a 0.05% drop in availability of the service and another type may cause a 50% drop in availability. If these failures occur at the same time, then it makes sense to direct resources to fixing the second failure before fixing the first. Simulating failures in a test environment provides information from which this type of decision can be made if a failure occurs organically.
An administrator, or other person, can cause a failure to occur while load runs against the system, thereby providing some information about how the system would react in the event that a component fails. In general, the execution of these test failures, and recovery from the failures, is a manual process. For example, an administrator could cause a specific failure, and could look at system performance during the failure. However, these manual techniques generally take an ad hoc approach to failure simulation, and make it difficult to orchestrate a complex set of failures, or to measure, accurately, how these failures affect the availability of the service that the system implements.