The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent the work is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
A distributed software system is a system that comprises many components running tasks independently while appearing to the end user as one system. An example of a distributed system is a collection of cloud services responsible for storing user information, providing users with endpoints to communicate, and keeping the user information updated in order to authenticate and authorize a user to perform tasks in other systems. Since a distributed software system depends on many components, a failure in one of the components can lead to a failure in the whole system. In order to ensure that these distributed systems are more resilient, the following approaches are commonly used.
For example, in one approach, a fault model for a given system is developed. In this approach, system owners manually generate a list of scenarios that can result in the failure of the system. Once the list is complete, the system owners then inject faults according to the fault model and verify whether the system is resilient to the various failure scenarios. Another approach involves chaotically injecting faults into a system component. In this approach, system owners can inject faults which impact a random percentage of machines and monitor the system behavior to verify whether the service is resilient.
A disadvantage of both these approaches is that the service owners have to spend a significant amount of time to verify whether the system is resilient to the various failure scenarios. Further, in case of the model based approach, designing the fault model is very tedious. This can make the model based approach less effective as the system itself might have changed during the process of generating the fault model. For example, the changes may include and are not limited to code and topology changes. In case of the chaotic injection approach, only a number of machines running a component is varied. This may not be enough in determining all the failure scenarios since some failure scenarios may depend on an intensity of a fault as well.