Cloud computing is the provision of dynamically scalable and often virtualized resources as a service over the Internet on a utility basis. Users need not have any knowledge of, expertise in, or control over the technology infrastructure in the “cloud” that supports them. Cloud computing services often provide common business applications online that are accessed from a web browser, while the software and data are stored on servers.
Cloud computing customers do not generally own the physical infrastructure serving as host to the software platform in question. They typically consume resources as a service and pay only for resources that they use. The majority of cloud computing infrastructures typically include services delivered through data centers and built on servers with different levels of virtualization technologies. The services are accessible from various locations that provide access to networking infrastructure. Clouds often appear as single points of access for all consumers' computing needs.
Cloud computing is quickly becoming the platform of choice for businesses that want to reduce operating expenses and be able to scale resources rapidly. Eased automation, flexibility, mobility, resiliency, and redundancy are several other advantages of moving resources to the cloud. On-premise private clouds permit businesses to take advantage of cloud technologies while remaining on a private network. Public clouds permit businesses to make use of resources provided by third party vendors. Hybrid clouds permit the best of both public and private cloud computing models. Many organizations are being introduced to cloud computing by building an on-premise Infrastructure-as-a-Service (IaaS) cloud, which delivers computing, storage, and networking resources to users. Some organizations utilize cloud computing technology in an evolutionary way that leverages and extends their existing infrastructure and maintains portability across different technology stacks and providers.
One or more physical host machines or virtual machines (VMs) may be employed in a cloud (hereinafter referred to as “nodes”). For VMs, each VM may function as a self-contained platform, running its own operating system (OS) and software applications (processes). Typically, a virtual machine monitor (VMM) manages allocation and virtualization of computer resources and performs context switching, as may be necessary, to cycle between various VMs. Virtualization systems provide a potential means to access computing resources in a confidential and anonymous way.
High availability, when applied to computer systems in general and cloud computing systems in particular, refers to the application of well-known techniques to improve availability (A) as defined by the equation A=MTBF/(MTTR+MTBF), where MTTR refers to mean time to recovery and MTBF refers to mean time between failures. MTBF is the predicted elapsed time between inherent failures of a system during operation. MTTR is the average time that a device may take to recover from any failure. Reducing MTTR may include the automation of manual operations of activities such as, but not limited to, fault detection, fault isolation, fault recovery, and administrative repair.
For software, increasing MTBF may include, but is not limited to, technical source code reviews, high quality automated validation, minimizing complexity, and employing software engineers having a mixture of levels of experience. For hardware, increasing MTBF may include, but is not limited to, using higher quality components, preemptively replacing hardware components prior to predicted wear out, and employing a sufficient burn in period to remove infant mortalities from a product delivery stream.
In current cloud computing systems, a management component of the cloud computing system typically polls for data concerning the health of managed components from one centralized location. These managed components may be nodes which may include one or more virtual machines in a network infrastructure. The centralized management component may periodically poll a node for state information, such as how much memory is consumed, how much disk space is consumed, the system load, or other details over the network. The management component then applies a policy to detect if a node is faulty, (e.g., the memory consumed is greater then 98%) based on data returned by the node.
Periodically polling nodes for state information and having a node transmit back to the management component state results may consume significant network resources and slow the time required to detect a failure. The slower detection time results in a higher MTTR and results in lower availability.