Field of Invention
The present invention generally relates to software defined networks (SDNs), and particularly relates to a system and method designed for an improved congestion control in SDNs.
Discussion of Related Art
Any discussion of the prior art throughout the specification should in no way be considered as an admission that such prior art is widely known or forms part of common general knowledge in the field.
The Transmission Control Protocol (TCP) is a core protocol of the Internet protocol suite. Therefore, the entire suite is commonly referred to as TCP/IP. TCP provides a reliable, ordered and error-checked delivery of a stream of bytes between applications running on hosts communicating over an IP network. Almost all major Internet applications such as the Web, email, and video transmission rely on TCP. It is known as a reliable stream delivery service, which guarantees that all bytes received will be identical with bytes sent and in the correct order. Since packet transfer over many networks is not reliable, a technique known as positive acknowledgment with retransmission is used to guarantee reliability of packet transfers. This fundamental technique requires the receiver to respond with an acknowledgment message (ACK) as it receives data packets. The sender keeps a record of each packet it sends. The sender also maintains a timer from when the packet was sent, and retransmits a packet if the timer expires before the message has been acknowledged with an ACK. The timer is needed in case a packet gets lost or corrupted. TCP is considered to be a reliable transport mechanism because it requires the receiving computer to acknowledge not only the receipt of data but also its completeness and sequence.
While IP handles actual delivery of the data, for efficient routing through the network, TCP keeps track of the individual units of data transmission, called segments that a message is divided into. TCP accepts data from a data stream, divides it into chunks, and adds a header creating a so-called TCP segment, which is then encapsulated with an Internet Protocol (IP) datagram, and exchanged with peers. The TCP header is 20 bytes and contains 10 mandatory fields, and an optional extension field. The data section follows the header. Its contents are the payload data carried for the application.
TCP uses a sliding window flow-control mechanism to control the throughput over wide-area networks between end-users. As the receiver acknowledges initial receipt of data, it advertises how much data it can handle, called its receiver window size (rwnd). The rwnd changes in time and depends on how many segments can be processed by the available free buffer space in the receiver. The sender can transmit multiple packets, up to rwnd, before it stops and waits for an ACK. The sender tries to fill up the pipe, waits for an ACK, and then fills up the pipe again up to rwnd. Therefore, the basic TCP flow control mechanism (between end-users) is the sliding window superimposed on a range of bytes beyond the last explicitly acknowledged byte. Its sliding operation limits the amount of unacknowledged transmissible data that a TCP sender can emit.
The sliding window flow control mechanism works in conjunction with the Retransmit Timeout Mechanism (RTO), which is a timeout to prompt a retransmission of an unacknowledged segment. The timeout length is calculated based on a running average of the Round Trip Time (RTT) for ACK receipt, i.e., if an acknowledgment is not received within (typically) the smoothed RTT+4*mean deviation, then packet loss is inferred and the segment pending acknowledgment is then retransmitted. Therefore, rwnd and RTT are the two key parameters of TCP flow-control.
TCP contain four intertwined algorithms for congestion control: Slow-start, congestion avoidance, fast retransmit, and fast recovery [see Allman et al., “TCP Congestion Control”, RFC5681, 2009.]. In addition, senders can employ an RTO that is based on the estimated RTT between the sender and receiver. The behavior of this timer is specified in [see Paxton et al., “Computing TCP's Retransmission Timer, RFC 6298, 2011.]. There are several prior art algorithms for estimation of WIT. Congestion can occur when data arrives on a big pipe (a fast LAN) and gets sent out a smaller pipe (a slower WAN). Congestion can also occur when multiple input streams arrive at a router whose output capacity is less than the sum of the input capacity.
Congestion avoidance is a way to deal with lost packets, measuring packet delay or network-supported Explicit Congestion Notification (ECN). Different variants of TCP have different procedures and behaviors. In the loss-based algorithm, for example, there is no explicit signaling about congestion. Therefore, an assumption is made that the loss of a packet signals congestion somewhere in the network between the sender and receiver. There are two indications of packet loss: a timeout occurring on an ACK, which triggers slow-start, and the receipt of duplicate ACKs (dupACK), which triggers congestion avoidance. In the delay-based algorithm, congestion avoidance and slow-start are both triggered by monitored packet delays and by reacting to increases in delay in an attempt to avoid network congestion. Congestion avoidance and slow start are two independent algorithms with different objectives. But, when congestion occurs TCP must slow down its transmission rate of packets into the network, and then invoke slow start to get things going again. In practice they are implemented together.
In the classical loss-based algorithms, congestion avoidance and slow start require that two variables be maintained for each connection: a congestion window, cwnd, of the sender and a slow start threshold, ssthresh. Slow start has cwnd begin at one segment, and be incremented by one segment every time an ACK is received. As mentioned earlier, this opens the window exponentially: send one segment, then two, then four, and so on. Congestion avoidance dictates that cwnd be incremented each time an ACK is received. This is a linear growth of cwnd, compared to slow start's exponential growth. The increase in cwnd should be at most one segment each round-trip time (regardless how many ACKs are received in that RTT), whereas slow start increments cwnd by the number of ACKs received in a round-trip time. TCP may generate an immediate acknowledgment (a duplicate ACK) when an out-of-order segment is received. This duplicate ACK should not be delayed. The purpose of this duplicate ACK is to let the other end know that a segment was received out of order, and to tell it what sequence number is expected.
Since TCP does not know whether a dupACK is caused by a lost segment or just a reordering of segments, it waits for a small number of dupACKs to be received. It is assumed that if there is just a reordering of the segments, there will be only one or two duplicate ACKs before the reordered segment is processed, which will then generate a new ACK. If three or more duplicate ACKs are received in a row, it is a strong indication that a segment has been lost. TCP then performs a retransmission of what appears to be the missing segment, without waiting for a retransmission timer to expire. After fast retransmit sends what appears to be the missing segment, congestion avoidance, but not slow start is performed. This is the fast recovery algorithm. It is an improvement that allows high throughput under moderate congestion, especially for large windows. The reason for not performing slow start in this case is that the receipt of the duplicate ACKs tells TCP more than just a packet has been lost. Since the receiver can only generate the duplicate ACK when another segment is received, that segment has left the network and is in the receiver's buffer. That is, there is still data flowing between the two ends, and TCP does not want to reduce the flow abruptly by going into slow start.
In summary, TCP's slow-start algorithm attempts to take full advantage of the network capacity. While the flow-control is typically controlled by the receiver-side window, rwnd, the congestion-control is controlled by the sender-side window, cwnd.
Note that these mechanisms are designed between the sender and receiver (end-to-end) assuming that the network plays no role in adjusting or interfering the TCP behavior. In conclusion, the pace of a TCP sender is controlled by cwnd, RTT, and the pace at which ACKs are received, while the upper bound is always rwnd.
One of the key observations in TCP networks is a phenomenon called bufferbloat [see Nichols, “Controlling Queue Delay: A modern AQM is just one piece of the solution to bufferbloat,” NETWORKS, May 6, 2012.]. It is a latency caused within a TCP network due to persistent buffer/queue fullness. These queues are called ‘bad queues’. Typically, queues may fill up because of traffic bursts, but they eventually clear up (within a few RTT after TCP flow control and congestion control slows down traffic). Bad queues do not clear up. They remain full causing all traffic passing through these queues to significantly slow down. The minimum packet sojourn time (the minimum time a packet traverses between getting in and out of the queue over a period of time) in a normal queue after a few RTT becomes zero. But, in the scenario of a bad queue it remains to be a fixed time period. Packet sojourn times become a primary contributor of delay in the network when there are bad queues. One of the goals of this invention is to define a creative method to detect and remove bad queues from the network, and doing so, significantly reduce the congestion on certain flows.
Software defined networking (SDN) is a recent programmable networking paradigm and a strong candidate to become the architecture of the future Internet. Fundamentally, the key concepts of SDN offer the basis for the system and method of this invention. A typical SDN is decoupled into two planes: a data plane comprised of ‘switches’, which perform data forwarding, and a control plane connecting all switches to a ‘controller’, which calculates routing (or flow) tables and sends them to the switches. Doing so, the packet forwarding and route calculation tasks are decoupled. The switches perform fast packet forwarding while the controller performs fast calculation of routes. Switches are mainly special-purpose hardware devices designed for packet switching, while the controller is software based and logically centralized. In an SDN, the controller sends forwarding rules to the network switches using a southbound interface such as OpenFlow [see McKeown et al., “OpenFlow: enabling innovation in campus networks,” SIGCOMM Computer Communication Review, April 2008.] to generally specify or modify the path of the data packets, or sometimes to alter the packet header fields.
The SDN controller has a global visibility of the network. Meaning, it collects real-time data from switches about the network topology, traffic performance, and volume of data flows. Accordingly, it can modify the traffic distribution within the network to optimize the network utilization. The fact that TCP relies solely on end-to-end measurements of packet loss or packet delay as the only sources of feedback from the network means that TCP has a very limited view of the network state such as the trajectory of available bandwidth, congested links, network topology, and traffic volumes. Thus, our question is: Can we build a system that observes the state of the end to end TCP path and even consider the general dynamics of an overall SDN, and change TCP's behavior accordingly? The answer is yes. We can simplify tune different TCP parameters (cwnd, rwnd, rtt and ACK pace) according to network conditions using feedback coming from the state of the network. When the SDN controller has the visibility of network queue fullness and potential bad queues in the network, it can take proper actions to reduce traffic to relieve bad queues and eliminate bufferbloat.
The controller can be provided the information of which flows are large and potentially more important under congestion according to an aspect of this invention. For example, some video streaming flows may be using UDP instead of TCP, which means under congestion, packet loss becomes inevitable. This will cause significant quality degradation perceived at the receiver side. If video streaming uses TCP, on the other hand, congestion will cause drastic slow down, which results in delay in getting video frames at the receiver side. In order to prevent congestion impacting such flows, controller can force flow-control on other flows sharing the same network resources with the flows carrying video streams. When these flows slow down, the bursts will be smoothed and bufferbloat in network switches will be eliminated. The resultant net effect will be reduced congestion specifically on video streams. Since the receivers (hosts) will most likely have large buffers (typically the case, except mobile hosts), they will not trigger flow-control.
According to an aspect of this invention, network switches will capture ACK messages coming from the receivers and either slow down their pace or modify the ACK header by reducing the rwnd according to an estimated (or artificial) RTT forcing some of the packet flows to reduce rate when one or more bad queues are detected in the network switches.
It is key to keep the behavior of the TCP stack in the end-user's host unchanged. Even if adding a new feature to end-user's TCP stack is an option, this is not feasible since the number of devices connected to Internet in 2015 globally has reached 10 Billion. Although a proposal is provided in [see Ghobadi et al., “Rethinking end to end Congestion Control in Software Defined Networks,” Proceedings of the 11th ACM Workshop on Hot Topics in Networks, 2012.] with a change in host behavior, such TCP stack changes are not practical and globally implementable.
Embodiments of the present invention are an improvement over prior art systems and methods.