Cloud computing is the provision of dynamically scalable and often virtualized resources as a service over the Internet on a utility basis. Users need not have any knowledge of, expertise in, or control over the technology infrastructure in the “cloud” that supports them. Cloud computing services often provide common business applications online that are accessed from a web browser, while the software and data are stored on servers.
Cloud computing customers do not generally own the physical infrastructure serving as host to the software platform in question. They typically consume resources as a service and pay only for resources that they use. The majority of cloud computing infrastructures typically include services delivered through data centers and built on servers with different levels of virtualization technologies. The services are accessible from various locations that provide access to networking infrastructure. Clouds often appear as single points of access for all consumers' computing needs.
Cloud computing is quickly becoming the platform of choice for businesses that want to reduce operating expenses and be able to scale resources rapidly. Eased automation, flexibility, mobility, resiliency, and redundancy are several other advantages of moving resources to the cloud. On-premise private clouds permit businesses to take advantage of cloud technologies while remaining on a private network. Public clouds permit businesses to make use of resources provided by third party vendors. Hybrid clouds permit the best of both public and private cloud computing models. Many organizations are being introduced to cloud computing by building an on-premise Infrastructure-as-a-Service (IaaS) cloud, which delivers computing, storage, and networking resources to users. Some organizations utilize cloud computing technology in an evolutionary way that leverages and extends their existing infrastructure and maintains portability across different technology stacks and providers.
One or more virtual machines (VMs) may be employed in a cloud. Each VM may function as a self-contained platform, running its own operating system (OS) and software applications (processes). Typically, a virtual machine monitor (VMM) manages allocation and virtualization of computer resources and performs context switching, as may be necessary, to cycle between various VMs. Virtualization systems provide a potential means to access computing resources in a confidential and anonymous way.
High availability, when applied to computer systems in general and cloud computing systems in particular, refers to the application of well-known techniques to improve availability (A) as defined by the equation A=MTBF/(MTTR+MTBF), where MTTR refers to mean time to recovery and MTBF refers to mean time between failures. MTBF is the predicted elapsed time between inherent failures of a system during operation. MTTR is the average time that a device may take to recover from any failure. Reducing MTTR may include the automation of manual operations of activities such as, but not limited to, fault detection, fault isolation, fault recovery, and administrative repair.
For software, increasing MTBF may include, but is not limited to, technical source code reviews, high quality automated validation, minimizing complexity, and employing software engineers having a mixture of levels of experience. For hardware, increasing MTBF may include, but is not limited to, using higher quality components, preemptively replacing hardware components prior to predicted wear out, and employing a sufficient burn in period to remove infant mortalities from a product delivery stream.
Current cloud high availability solutions focus on passive monitoring of a virtual machine. If the infrastructure (e.g., the hypervisor or virtual machine monitor) returns an indicator that the virtual machine has in some way failed, the virtual machine is restarted. In case of an infrastructure related problem, the virtual machine is restarted continuously.
Existing bare-metal high availability products execute recovery escalation of a cluster node by turning off and on power to that node (power fencing). If the node fails repeatedly, there is no further escalation that takes place (e.g., permanently terminating the power to the node until an operator intervenes). Some attempts have been made to provide to escalate failures from a lower level to a higher level component of a bare-metal cluster of computer nodes.
In a conventional bare metal systems model, a “service unit” includes multiple software applications. If an application fails to meet a user defined policy, the service unit is failed. If a service unit fails repeatedly, a higher level component called a service group may be restarted. If the service group fails repeatedly, no further escalation is taken.