1. Field of the Invention
This invention relates to communication networks. More particularly, this invention relates and particularly to congestion control in communication network virtualization.
2. Description of the Related Art
The meanings of certain acronyms and abbreviations used herein are given in Table 1.
TABLE 1Acronyms and AbbreviationsARPAddress Resolution ProtocolCNMCongestion-notification messageDCTCPDatacenter TCPECNExplicit Congestion NotificationIBInfiniBandIETFInternet Engineering Task ForceIPInternet Protocol; Also Internet Protocol AddressNICNetwork Interface CardNVGRENetwork Virtualization using Generic Routing EncapsulationOSIOpen Systems InterconnectionQCNQuantized Congestion NotificationQPQueue Pair (a transmit queue and a receive queue)RFCRequest for CommentsSRIOVSingle-Root I/O virtualizationTCPTransmission Control ProtocolUDPUser Datagram ProtocolVIOCVirtual Input-Output ConnectionVMVirtual MachineVXLANVirtual eXtensible Local Area Network
Network virtualization involves creating virtual OSI Layer-2 and/or Layer-3 topologies on top of an arbitrary physical (Layer-2 or Layer-3) network. Network Virtualization decouples virtual networks and addresses from physical network infrastructure, providing isolation and concurrency between multiple virtual networks on the same physical network infrastructure. Such virtualized networks can be used, for example, in data centers and cloud computing services. Virtualized networks of this sort are commonly referred to as “overlay networks” or “tenant networks”.
When performing network communication, there can be congestion in the network. The following are typical scenarios:
(1) Multiple senders send traffic to the same receiver.
(2) A receiver is slower than the originating traffic sender.
(3) Multiple flows share the same link inside the network fabric.
In the event of such congestion, network switches can buffer the traffic up to a limit, after which they either cause backpressure or drop part of the traffic. Both of these actions can harm performance—dropping traffic will cause a retransmission, while applying backpressure and stopping the originating switch will cause congestion to spread, possibly slowing down the entire network.
Some communication networks apply congestion-control mechanisms for mitigating traffic congestion in the network. For example, congestion control for Infiniband™ networks is specified in InfiniBand Architecture Specification Volume 1, release 1.2.1, Annex A10, November, 2007, pages 1650-1697, which is incorporated herein by reference.
As another example, congestion control for Ethernet™ networks is specified in IEEE Standard 802.1Qau-2010, entitled IEEE Standard for Local and Metropolitan Area Networks-Virtual Bridged Local Area Networks; Amendment 13: Congestion Notification, Apr. 23, 2010, which is incorporated herein by reference.
Another approach to congestion control is disclosed in RFC 2581, dealing with TCP window control. The sender or source host keeps track of the number of packets sent to the network that are unacknowledged by the receiving host. The number of packets that are allowed to be in flight, and not acknowledged, has a limit that depends upon estimation by the source host of the congestion situation in the network. The source host treats packet loss or increase in the round trip time as a signal for congestion, while successfully acknowledged packets and decreasing or stable round trip time are treated as indicating a lack of congestion.
Both Explicit Congestion Notification (ECN), defined in RFC 3168, and DCTCP congestion control also use a congestion window, but decide upon the congestion window size using explicit marking. A switch, instead of dropping packets, marks the packets as “congestion encountered”, using a special bit in the packet header. The receiving host uses the special field in the acknowledgement packets it sends to the source to indicate that it received a packet marked with congestion encountered.
Commonly assigned U.S. Pat. No. 8,705,349 to Bloch et al., which is herein incorporated by reference, describes regulation of the transmission rate of packets selectively, based on destination address. A source network interface identifies the destination address of packets that triggered a notification and were marked by the network element. The source network interface then regulates the transmission rate of subsequent packets that are addressed to the identified destination address, e.g., by forcing a certain inter-packet delay between successive packets.
A number of protocols have been developed to support network virtualization. For example, Sridharan et al. describe the NVGRE virtualization protocol in an Internet Draft entitled “NVGRE: Network Virtualization using Generic Routing Encapsulation,” draft-sridharanvirtualization-nvgre-01 (Jul. 9, 2012), published by the Internet Engineering Task Force (IETF). Another network virtualization protocol is VXLAN (Virtual eXtensible Local Area Network), which is described by Mahalingam et al. in an Internet Draft entitled “VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks,” published by the IETF as draft-mahalingam1058-1073S1 2 dutt-dcops-vxlan-02 (Aug. 22, 2012). These protocols are directed to virtualization and encapsulation of Layer 2 communications (such as Ethernet™ links) over Internet Protocol (IP) networks.
Attempts have also been made to provide a framework for encapsulation and transmission of various protocols over InfiniBand™ (IB) networks. For example, Chu and Kashyap describe a method for encapsulating and transmitting IP and Address Resolution Protocol (ARP) packets over IB in “Transmission of IP over InfiniBand (IPoIB),” published in 2006 as IETF RFC 4391. This document specifies the link-layer address to be used when resolving IP addresses in IB subnets and the setup and configuration of IPoIB links.
The congestion control protocols mentioned above assume that all parties are following them, meaning that a misbehaving party could abuse the system and preempt an unfair share of the available bandwidth. This becomes especially problematic in a cloud environment, where the tenants are untrusted, while the service provider would want to provide well-behaving clients with a lossless network.
For example, U.S. Pat. No. 8,201,168 describes the use of virtual input-output connections for machine virtualization. A virtual computer system includes at least one virtual or physical compute node, which produces data packets having respective source attributes. At least one virtual input-output connection (VIOC) is uniquely associated with the values of the source attributes. The virtual computer system is implemented on a physical computer system, which includes at least one physical packet switching element. The physical packet switching element is configured to identify the data packets whose source attributes have the values that are associated with the VIOC and to perform operations on the identified data packets so as to enforce a policy with regard to the VIOC.
An additional challenge in this respect is the fact that SRIOV acceleration, which is used in some virtualized environments, prevents the hypervisor (which is a trusted entity) from seeing and being able to control most of the sent traffic, as the guest virtual machine, which is not necessarily trusted, is allowed to communicate directly with the hardware.
Currently, to ensure service level, a cloud provider might define a limit for the bandwidth a virtual machine (VM) is consuming, and track for each user the number of transferred bytes. Such tracking may disregard physical or logical distance between a transmitter and receiver, or may employ a metric of a physical or logical distance, or both, that bytes travel. While most modern network hardware support priority definition between a small number of traffic classes, this feature is not commonly used in cloud computing solutions, due to the highly limited number of queues, which imposes an extremely low limit on the number of traffic classes.
Some congestion control protocols also suffer from issues regarding fairness in allocation of resources among different flows, with some of the flows ending up getting a share that is either considerably bigger or smaller than what the user is expecting. For example, in FIG. 1, circumstances might result in one source host sending at 60% line rate, a second source host sending at 30% line rate and a third source host 10 sending at 10%.
Another challenge is the convergence time—the amount of time it takes for the congestion control protocol to reach a stable state after a change in the network topology or flow patterns.
Finally, a congestion control protocol might suffer from oscillations (lack of stability), where the transmission rate or window is repeatedly increased and decreased with significant amplitude.
Some existing congestion control protocols use additive increase, multiplicative decrease rate-control schemes, where a value is added to the rate after a certain amount of time has passed without congestion notification, and upon receiving a congestion notification the rate is decreased by a specific multiplier.