Most enterprises resort to hosting their applications on a co-located or cloud datacenter. Typically, these applications are complex distributed applications that in addition to comprising multiple components (e.g., modules or micro-services) may require complex interactions between the different components. Furthermore, these applications may rely on specific infrastructure and middleware components provided by the cloud provider itself. It is vital to business operations that these cloud hosted distributed applications are constantly available, because the cost of downtime can be significant. It is not hyperbole to state that a single hour of downtime can cost a business retailer tens of thousands of dollars.
Downtime does not only affect revenue generation lost, in fact the true cost of downtime can be much higher. The true cost can include, for example, lost or dissatisfied customers, damage to a company's reputation, lost employee productivity, and even devaluation of the business (e.g., falling stock prices). A large number of non-malicious failures occur during routine maintenance (e.g., uninterruptable power supply (UPS) replacement, failure of a machine hard disk, adding of new machines or deprecating old machines from the cluster).