Some current data centers run server virtualization software on compute nodes. These compute nodes, also known as hypervisor nodes, generate lots of network traffic that includes traffic originating from the virtual machines, as well as lot infrastructure traffic. Infrastructure traffic is traffic that originates from the hypervisor layer rather than the virtual machines. The end point IP addresses in infrastructure traffic are the hypervisor addresses. Some examples of the different kinds of traffic originated at the hypervisor are: management traffic (i.e., network traffic used to manage the hypervisors), virtual machine migrator traffic (i.e., network traffic generated when a virtual machine is moved from one host to another host); storage traffic (i.e., network traffic generated when a virtual machine accesses it's virtual disk hosted on a network share (Network Attached Storage (NAS) such as Network File System (NFS) or Direct Attached Storage (DAS) such as Virtual Storage Area Network (VSAN)); virtual machine traffic encapsulated by the hypervisor. (i.e., network traffic between virtual machines that is encapsulated using technologies such as a Virtual Extensible Local Area Network (VXLAN)).
In some current systems, flows of these different traffic types are segregated at the level of network fabric using virtual local area networks (VLANs) for various reasons. In some cases, the flows are segregated for reasons related to isolation in terms of security as well as quality of service. Under such a scheme, typically the hypervisor host at which the traffic originates is responsible for adding corresponding VLAN tags as the packets leave the host. In order to achieve this goal, a hypervisor host typically maintains one or more virtual network interfaces (such as eth[0 . . . n] on Linux or vmk[0 . . . n] on ESX) for each of the VLANs. In the presence of multiple IP interfaces on different VLANs, a sender application does one of the following, while sending out a packet on the host: first, explicitly specify the virtual interface to egress the packet. This is useful for cases where sender application wants to implement a multi-pathing type of send behavior.
Such implementations have the following disadvantages: (a) the intelligence as to which interface to use has to be built into each application that uses VLAN interfaces; (b) in some ways, such an implementation bypasses the IP routing table and as such can have issues when the application's implementation for working with a routing table is not consistent with the underlying TCP/IP stack processor's routing table. Second, the sender application may rely on the hypervisor's TCP/IP stack processor to make a decision based on the routing table on the host. This relies on standard routable table behavior where typically each VLAN is assigned a different subnet address, and based on the destination IP address, the system determines which interface to use.
Some systems operate differently depending on whether or not the destination IP address of the hypervisor for a flow is directly reachable via a Layer 2 (L2) network. If the destination hypervisor for that flow is directly reachable via an L2 network (i.e., the source and destination are on the same subnet), the sender's TCP/IP stack processor does not have to use the default gateway route, and routing is straightforward. However, if the destination hypervisor is not directly reachable via an L2 network (i.e., the source and destination are on different subnets); the sender's TCP/IP stack processor will have to rely on a gateway for sending packets to the destination subnet. This is especially important when the destination hypervisor is reachable via a long distance network connection where routers and gateways of L3 networks are the norm.
Since a TCP/IP stack processor of current systems supports only one default gateway, the gateway for the management traffic takes that spot in current systems. However, as explained above, other flows may not be able to reach their gateway address, if the gateway is on a different subnet/VLAN.
One method of addressing this issue in current systems is by using multiple non-default gateway addresses in the IP routing tables of a single TCP/IP stack processor. However, the current system of managing static routes for adding non-default gateways suffers from the following issues: (1) it is cumbersome and error prone; (2) It is also seen as a security risk, so many entities that use data centers and enterprise networks do not implement static routes.
The consequence of not having multiple non-default gateways in current systems is that those sender applications that rely on an L3 gateway to reach their counterpart on another hypervisor cannot get their functionality to work. As a result of which, virtual machine migrators, storage and similar hypervisor services do not work across Layer 3 (L3) boundaries in current systems. This is especially relevant when these services are expected to work long distance or in a spine-leaf network topology.
Spine-Leaf is a well-understood network topology that provides for maximum utilization of network links in terms of bandwidth. The idea is to define an access switch layer of Top of Rack (ToR) switches connect to hypervisors on the south side, and to a layer of aggregate switches on the north side. The aggregate layer switches form the spine. The access layer switches and the hypervisors form the leaves of the network. The key aspect of this topology is that the access switches define the L2 network boundary on the south side. In other words, they terminate VLANs. To reach from one access switch to another access switch, some systems rely on L3 network routing rather than extending the L2 network fabric. This puts many of the hypervisor services such as virtual machine migrators and storage under risk since they rely on L2 network connectivity.
In some current systems, multiple network applications run on a hypervisor host. Each of these applications can be very network intensive and can consume resources from the underlying TCP/IP stack processor and render other applications without resources. Some situations can be as bad as a user not being able to use secure shell (SSH) to reach the hypervisor host, since the heap space is used up completely by one of the other applications.
In some current systems, if a hypervisor is hosting workload/virtual machines of multiple tenants, security is of paramount importance. At the network level, putting the different tenants on different VLANs or physical network fabric provides security/isolation. However, in current systems, at each hypervisor host, there is one TCP/IP stack processor providing transport for all these different tenants and flows. This is potentially a gap in the security model, since the flows can mix at the level of the hypervisor.
Data is sent on networks as individual packets. One type of packet is an Internet protocol (IP) packet. Data is generated by processes on a machine (e.g., a host machine). The data is then sent to a TCP/IP stack processor to transform the data into packets addressed to the destination of the data. A TCP/IP stack processor is a series of networking protocols that transform data from various processes into IP packets capable of being sent over networks such as the Internet. Data is transferred across networks in individual packets. Each packet includes at least a header, with a source and destination address, and a body of data. As a data packet is transformed by each layer of a TCP/IP stack processor, the protocols of the layers may add or remove fields from the header of the packet. The end result of the transformation by the TCP/IP stack processor is that a data payload is encapsulated in headers that allow the packet to traverse an internet protocol (IP) network.
Data centers and enterprise networks with multiple hosts implement a single TCP/IP stack processor on each host to handle the creation of IP packets, outside of virtual machines on the host, for sending on IP networks. The single TCP/IP stack processor also parses IP packets that are received from other processes on the host and from machines and processes outside of the host.
The single TCP/IP stack processor of existing networks provides IP packet creation and parsing for a wide variety of processes operating on the host. However, there are disadvantages to using a single TCP/IP stack processor for all processes operating on a host outside of virtual machines on the host. For example, it is possible for one process to use all the available IP packet bandwidth and/or resources of the TCP/IP stack processor, leaving other processes unable to communicate with machines and processes outside the host through IP packets. Furthermore, a single TCP/IP stack processor is limited to a single default gateway for sending packets with destination addresses that are not in routing tables of the TCP/IP stack processor.