Cloud services are well known as on-demand services provided to end-users over the Internet and hosted by cloud service providers. Cloud services are typically deployed as sets of applications/processes running on one or more Virtual Machines (“VMs”). Such VMs serve as emulations of a particular computer system which run on the hardware of a cloud service provider's data center.
Many organizations are increasingly embracing cloud services for outsourcing their Information Technology departments, enhancing flexibility, and improving efficiency. But because such cloud services are typically hosted by third party cloud service providers and accessed over the Internet, an organization's ability to access and use cloud services may be affected by conditions outside of the organization's control, such as anomalous circumstances (e.g., natural disasters, power failure, cybersecurity attacks, etc.) that may spontaneously disrupt availability of the cloud service provider's servers or network connectivity.
But because virtualization is at the core of cloud infrastructure, portability of server operating environments is inherently enabled. Indeed, as VMs are not bound to particular hardware, they may be migrated from one server to another. While current implementations of live migration have focused on the migration of single or multiple VMs, what has not been addressed is the migration of complete services, which requires the migration of a cluster of VMs as they continue to operate together to avail cloud services. Such migration may allow for complete services (as opposed to just a VM that provides a portion of a service) to remain available to an organization even if access to, or the availability of, a cloud service provider's initial host server is compromised.
Generally, VM migration may be either a live migration, which keeps the VMs running and their services in operation during migration, or a non-live migration, which temporarily shuts down the VMs for migration from an initial host server and restarts them once their memory state has reached a destination server. Live migration of VMs has been used in various tasks, including IT maintenance (e.g., by transparently migrating VMs off of a host which will be brought down for maintenance), load balancing (e.g., by migrating VMs off of a congested host to a machine with a lower CPU or I/O load), power management, (e.g., by migrating VMs from multiple servers onto fewer servers in order to reduce data center power consumption), and development-to-operations support (e.g., by migrating VMs residing in the development environment over to the test environment, and then to the operational environment).
As such, it is contemplated that live migration may be used as a mechanism for improving resilience/availability of cloud services. Assuming a cloud infrastructure as a service (IaaS) model where cloud service providers manage virtual machine instances and offer them as a service to customers, it follows that when there is an anomaly in a cloud infrastructure that can result in disruption of the cloud (i.e., the cloud servers are no longer functional), VMs will need to be migrated to preserve the availability of the services they are providing.
A problem which still exists is that migrating a large number of VMs can take a long time, which users may not have in the event wherein the cloud is under disruption. For instance, to migrate a 2 GB VM from source to destination host in the same subnet with reasonable bandwidth can take tens of seconds. To address this problem, VMs need to be filtered for migration based on how valuable they are to their owner. As a result, the priorities of cloud services should be used in determining which ones should be migrated so as to maximize the availability of the highest priority services.
Accordingly, a need remains for a system and method which can manage live migrations of VMs to maximize the availability of high priority cloud services when the cloud is under disruption. Notably, when a set of VMs are sought to be migrated from one host to another due to a disruption that may cause a potential service interruption, it is important to realize that the interruption will place a limit on the time available for the migration to process. Thus, it is contemplated that in order to adequately address this need, such a system and method will need to manage the live migration by automatically identifying which set of VMs to migrate, where to migrate them, and under what conditions the migration should happen.