In the world of virtual computing, virtual machines (VMs or guests) can be instantiated at a software level on physical computers (host computers or hosts). In various virtualization scenarios, a software component often called a hypervisor can act as an interface between the guests and the host operating system for some or all of the functions of the guests. In other virtualization implementations, there is no underlying host operating system running on the physical, host computer. In those situations, the hypervisor acts as an interface between the guests and the hardware of the host computer. Even where a host operating system is present, the hypervisor sometimes interfaces directly with the hardware for certain services. In some virtualization scenarios, the host itself is in the form of a guest (i.e., a virtual host) running on another host. The services performed by a hypervisor are, under certain virtualization scenarios, performed by a component with a different name, such as “supervisor virtual machine,” “virtual machine manager (VMM),” “service partition,” or “domain 0 (dom0).” (The name used to denote the component(s) performing this functionality can vary between implementations, products and/or vendors.) In any case, just as server level software applications such as databases, enterprise management solutions and e-commerce websites can be run on physical computers, so too can server applications be run on virtual machines.
High-availability clusters (also known as HA clusters or failover clusters) are groups of computers that support running server applications with a minimum of down-time. A high-availability cluster uses groups of redundant computing resources in order to provide continued service when individual system components fail. More specifically, high-availability clusters eliminate single points of failure by providing multiple servers, multiple network connections, redundant data storage, etc. Absent clustering, if a server running a particular application fails, the application would be unavailable until the server is restored. In high-availability clustering, the failure of a server (or of a specific computing resource used thereby such as a network adapter, storage device, etc.) is detected, and the application that was being run on the failed server is automatically restarted on another computing system (i.e., another node of the cluster). This process is called “failover.” As part of this process, clustering software can configure the node to which the application is being moved, for example mounting a filesystem used by the application, configuring network hardware, starting supporting applications, etc. High-availability clusters typically use a heartbeat private network connection to monitor the status of each node in the cluster. High-availability clusters are often used for critical server applications such as enterprise databases, important business applications, electronic commerce websites, etc.
In cloud-based computing environments, computing resources such as processing power, storage and software applications are provided as services to users over a network (e.g., the internet). In cloud computing, the use of virtual machines is common to isolate specific computing resources within the cloud for specific users (e.g., different organizations or enterprises that are receiving computing services from the cloud). For example, running a plurality of virtual machines on one or more underlying physical computers lends itself well to partitioning computing resources to different organizational users over the cloud, while keeping the resources of the different users separate, private and secure.
In a private cloud, a set of computing resources is operated for a single organizational user, and made available to that organization over a network. Virtual machines are commonly used in private cloud environments too. For example, because virtual machines can be suspended and restarted on different hosts, the use of virtual machines in a private cloud provides mobility.
In order to provide an application with high availability in a cloud environment (private or otherwise), the application can be run on a virtual machine which is in turn running on a high-availability cluster. The virtual machine provides the desired mobility and isolation of the application, whereas the underlying high-availability cluster provides the highly available computing infrastructure. It is important to understand that in this scenario there separate levels of availability: the availability of the application is dependent on the virtual machine being available, and the availability of the virtual machine is dependent upon the underlying physical computing infrastructure being available, i.e. the infrastructure of the high-availability cluster on which the virtual machine is running.
The administrator of the application running on the virtual machine is responsible for ensuring that the application is available to the organizational user according to a service level agreement (SLA). An SLA specifies the level of availability which the organization is to be provided, which is typically tied to an amount paid by the organization. For example, an SLA could specify a specific number of nodes within the cluster available for failover to the virtual machine on which the application runs in the event that the host crashes. An SLA can also specify that in the event of a cluster failure, one or more other clusters are available to the application for disaster recovery. At a high level, an SLA can be thought of as specifying the nodes within a given cluster as well as the other clusters (if any) available to an application in the case of infrastructure failure.
Software tools exist to monitor the health and status of an application, including an application running on a virtual machine. Such tools run on the computer (virtual or physical) on which the application runs, and automatically detect any application level failure. These tools can also automatically restart the application on the computer (e.g., the virtual machine). Thus, so long as the virtual machine on which an application is running is available and capable of running the application, any application level crashes can be managed without moving the virtual machine to a different node or cluster. For example, an application high availability tool running on the virtual machine can detect the failure of the application, and automatically restart the application on the virtual machine.
Software tools also exist to monitor the health and status of virtual machines. These tools run on the host on which the virtual machine runs, and automatically detect any failure of the virtual machine. These tools can also automatically reboot the virtual machine on the host. Thus, so long as the underlying host on which the virtual machine is running is itself available and capable of running the virtual machine, any virtual machine level crashes can be managed without moving the virtual machine to a different node or cluster.
The administrator of an application being served from a cluster to an organizational user over a network (e.g., as a cloud service) only has control over the application and the virtual machine on which the application is running. Thus, the application administrator can configure the application and virtual machine to address application level failure and virtual machine level failure, for example through the use tools such as those described above, or by manually configuring or restarting the application and/or the virtual machine. Since the application administrator has access to the application and virtual machine, the application administrator can configure these components to manage failures that occur at these levels.
However, the application administrator does not have control over or access to the infrastructure of the high-availability cluster. Thus, the application administrator cannot configure the infrastructure to address failures at a node or cluster level. Providing application availability according to an SLA can require configuration of the infrastructure for failover and disaster recovery, in the event that a failure occurs at an infrastructure level, as opposed to an application or virtual machine level. For example, suppose an application is running on a virtual machine, and the physical host on which the virtual machine runs fails. In this case, the virtual machine (along with the application it is running) would need to be failed over to another host to keep the application available. In a case where the whole cluster fails, the virtual machine would need to be moved to another cluster to remain available. Because the application administrator only has access to the application and the virtual machine it runs on, but not to the cluster infrastructure, the application administrator is not able to configure the infrastructure to support moving the virtual machine between nodes in the cluster or between clusters in order to keep the virtual machine and its application available in the event of infrastructure level failure.
An infrastructure administrator who is logged into the high-availability cluster and has infrastructure level access can configure the infrastructure to support failover of virtual machines between nodes and clusters. However, an infrastructure administrator may or may not be present or available when an application administrator wishes or needs to configure an application being run on a virtual machine hosted on the infrastructure to be highly available according to an SLA which specifies failover between nodes or disaster recovery between clusters. This interferes with the high availability and mobility of the application, both of which are important within the context of high-availability clusters and cloud computing environments such as private clouds.
It would be desirable to address these issues.