Communication networks are shifting from the physical to the virtual. In the past, a communication network would be built using a physical infrastructure to support a given network. More and more, the infrastructure is becoming virtual. That is, instead of building a dedicated physical network, or instead of sharing a network with others that is not designed with a specific purpose in mind, virtual networks are being used in which a customized network that gives its user the appearance of a dedicated network, with specific, customized protocols, that is built on top of a shared physical substrate. The virtual network is a private network for its (virtual) operator, while the underlay is shared amongst different operators.
Virtualization has quickly transformed the way physical resources are utilized today. Originally designed for isolating servers and sharing resources over a physical server, virtualization provides fast and agile deployment as well as migration by allowing a server to be defined entirely by software. This turns computing into an elastic resource and is catching on fast with other commercial entities as well. The virtualization paradigm extends to networking. For instance, it enables multiple research groups to run multiple overlay testbeds across different virtual slices of a planetary-scale network. In fact, the agility and flexibility brought forward by virtualization has led researchers to believe that the next generation Internet can be de-ossified through a lean separation between infrastructure providers and service providers where a virtualized infrastructure is offered as a service.
One key aspect of such virtualized architecture is to properly assign the underlay resources to the virtual network on top. Since the resources used are virtualized, they can be placed at different spots in the physical underlay, and careful allocation of the virtual resources to the physical is critical for the best performance of the network. When done properly, each virtual network performs better and the utilization (and thus reduce the costs) of the physical underlay is increased.
With infrastructure rapidly becoming virtualized, shared and dynamically changing, it is essential to provide strong reliability to the physical infrastructure, since a single physical server or link failure affects several shared virtualized entities. Reliability is provided by using redundancy. Currently, reliability is provided by duplicating resources. This is because reliability is provided at the physical layer. Thus, failure of a physical component is handled by bringing up another physical element. In a virtualized infrastructure, those are virtual elements that need to be backed up, and failure of a physical component implies the disappearance of some virtual components, and these virtual components have to be relocated onto other physical component.
Providing reliability is often linked with over-provisioning both computational, network, and storage capacities, and employing load balancing for additional robustness. Such high availability systems are good for applications where large discontinuity may be tolerable, e.g. restart of network flows while rerouting over link or node failures, or partial job restarts at node failures. A higher level of fault tolerance is required at applications where some failures have a substantial impact on the current state of the system. For instance, virtual networks with servers which perform admission control, scheduling, load balancing, bandwidth broking, AAA or other NOC operations that maintain snapshots of the network state, cannot tolerate total failures. In master-slave/worker architectures, e.g. MapReduce, PVM, failures at the master nodes waste resources at the slaves/workers.
Network virtualization is a promising technology to reduce the operating costs and management complexity of networks, and it is receiving an increasing amount of research interest. Reliability is bound to become a more and more prominent issue as the infrastructure providers move toward virtualizing their networks over simpler, cheaper commodity hardware.
Others have considered the use of “shadow VNet”, namely a parallel virtualized slice, to study the reliability of a network. However, such slice is not used as a back-up, but as a monitoring tool, and as a way to debug the network in the case of failure.
Meanwhile there are some works targeted at node fault tolerance at the server virtualization level. At least one introduced fault tolerance at the hypervisor. Two virtual slices residing on the same physical node can be made to operate in synchronization through the hypervisor. However, this provides reliability against software failures at most, since the slices reside on the same node.
Others have made progress for the virtual slices to be duplicated and migrated over a network. Various duplication techniques and migration protocols were proposed for different types of applications (web servers, game servers, and benchmarking applications). Another system allows for state synchronization between two virtual nodes over time. It is, thus, practically possible to have redundant virtual nodes distributed over a network for reliability. However, these solutions do not address the resource allocation issue (in compute capacity) while having redundant nodes residing somewhere in the network.
At a fundamental level, there are methods to construct topologies for redundant nodes that address both nodes and links reliability. Based on some input graph, additional links (or, bandwidth reservations) are introduced optimally such that the least number is needed. However, this is based on designing fault tolerance for multiprocessor systems which are mostly stateless. A node failure, in this case, involves migrations or rotations among the remaining nodes to preserve the original topology. This may not be suitable in a virtualized network scenario where migrations may cause disruptions to parts of the network that are unaffected by the failure.
Fault tolerance is also provided in data centers. Redundancy is in terms of large excess of nodes and links. Some protocols are defined for failure recovery, but there is little guarantee of reliability.