Regular upgrades to computer systems are critical to survival and success of an enterprise. Upgrades can include upgrades in software and infrastructure. While upgrades are critical, incorporating updates imposes risk. In some cases, an upgrade can be incompatible with un-upgraded features of a computer system. Therefore, users of computer systems may avoid implementing upgrades in order to assure system stability.
A distributed software solution serving some business functions will, sooner or later, fail because of a variety of reasons. When failure happens, a series of activities is usually followed to prevent the failure from happening again. The drawback with this approach is that a failure must happen first, which could result in some undesirable effects such as a financial loss or customer complaints. Efforts followed to identify and test every possible failure scenario in a test environment usually fall short. This is because a test environment is often different from the corresponding production environment.
Chaos and latency monkeys have been used to address failures production environment. The idea behind the chaos and latency monkeys is to actually inject a failure in the production environment by shutting or slowing down a service during business hours, not holidays or weekends. One problem with such an approach is that is that the production requests and possibly service level agreements can actually be impacted in a negative way.