Virtualization management software allows multiple virtual machines (VMs) to execute on a single hardware computing platform. Each VM is an abstraction of a physical computing system and executes a “guest” operating system. Virtualization management software also manages how hardware computing resources are allocated to each VM. A group of hardware computing platforms may be organized as a cluster to provide the hardware computing resources for VMs. In a data center, it is common to see hundreds, even thousands, of VMs running on multiple clusters of host servers.
When a server cluster at one location fails, the virtual infrastructure at that location may be recovered at a remote location through a disaster recovery process. Such disaster recovery restarts the entire data center (or a portion thereof) at the remote location by replicating the virtual infrastructure at the remote location. Commercially available disaster recovery products include VMware® vCenter™ Site Recovery Manager™.
In some disaster recovery products, site recovery manager (SRM) servers provide disaster recovery services to virtual machines (VMs) managed by a VM management server. SRM servers work in pairs—one of the SRM servers in the pair is registered to (i.e., works with) a VM management server at the “protected” site, and the other SRM server in the pair is registered to a VM management server at the “recovery” site. Notably, multiple SRM servers at the recovery site that communicate with corresponding SRM servers at the protected site may be registered to a single VM management server. Such a configuration may be employed during an N-to-1 disaster-recovery setup.
Efficiently configuring the VM management servers and SRM servers (e.g., optimizing pairings, registrations, etc.) requires understanding and manipulating the topology of the servers deployed across the sites. Traditional approaches to configuring the VM management servers and SRM servers focus on a single registration or pairing in isolation. Such a myopic approach does not facilitate efficient re-configuration in the event of errors or optimizing recovery workflows. For example, in some disaster recovery products, when one of the SRM servers in a pair is unavailable, the traditional approaches do not facilitate identification of holistic options, such as other SRM servers that are reachable from the available SRM server in the pair. Instead, understanding the broader topology and then identifying candidates for re-assignment and/or re-pairing is a manual process that is tedious and can negatively affect user experience and efficient execution of disaster recovery workflows.