The promise of “cheaper” and “faster” IT services encourages enterprises to adopt cloud computing. System reliability remains important but system reliability is now met without requiring dedicated hardware. Instead, availability goals rely on commodity cloud components that may be 3rd party operated and possibly individually less reliable. As such, cloud-based systems must be aware of potential issues and seamlessly distribute services and data to maintain availability guarantees. A cloud-based system's cycle of failure-detection and recovery becomes the norm throughout which different components in the architecture must adapt and be resilient. During periods of limited capability, not all requests can be accommodated at the usual level of service.
Traditional approaches to maintain system reliability are highly engineered brittle solutions (e.g., provisioning a known fixed set of resources ahead of time) and use expensive specialized hardware (e.g., hardware load balancing equipment and high speed interconnects). When traditional architectures experience a higher demand in requests that overwhelms the capacity of the existing services, the experience of all requests degrade equally because there is no mechanism to consider the importance (priority) or differentiated needs of service requests and/or sessions/users. Traditional approaches redirect sessions equally when a site is down, and thereby may overwhelm remaining services resulting in poor quality of service (QoS) for all. Traditional approaches use static routing so that when a computing node is lost, all subsequent requests are routed through a particular site and their performance/QoS suffers equally. When traditional architectures experience high demand in a service tier (e.g., static webpage servers, application-logic servers, or database server tiers), service requests directed to the tier suffer from similar service degradations. Traditional architectures often use a two site configuration (e.g., hot-hot or master-slave) that maintain consistency (e.g., primary system and mirror system) via high speed data connections. Due to cost, network bandwidth, and network latency issues, such configurations to achieve redundancy are often limited to a metro cluster (e.g., within 100 km between sites or 5 millisecond communication delay). Unfortunately, traditional two site configurations do not provide for protection from geographical events (e.g., an earthquake causing widespread service outage over a 100 km radius disaster zone).
In today's cloud-based architectures, users and service requests experience the same slow service and outage when demand is high or capability degrades (e.g., performance degradation). The cloud model does not provide a way for specialized designs and/or hardware (e.g., load balancer) to be implemented directly within each service in order to prioritize users and transaction types in order to gracefully degrade.