The present invention relates generally to network congestion, and, more particularly, to resolving network congestion.
Data Center Ethernet (DCE) is an emerging industry standard which proposes modification is to existing networks, in an effort to position Ethernet as the preferred convergence fabric or all types of data center traffic. A recent study has found that Ethernet is the convergence fabric, with I/O consolidation in a Data Center as shown in FIG. 1. This consolidation is expected to simplify platform architecture and reduce overall platform costs. More details of proposals for consolidation are described in “Proposal for Traffic Differentiation in Ethernet Networks,” which may be found at http://www.ieee802.org/1/files/public/docs2005/new-wadekar-virtual%20-links-0305.pdf.
Major changes have been proposed for DCE (also referred to as enhanced Ethernet and low latency Ethernet), including the addition of credit based flow control at the link layer, congestion detection and data rate throttling, and the addition of virtual lanes with quality of service differentiation. It is important to note that these functions do not affect Transmission Control Protocol/Internet Protocol (TCP/IP), which exists above the DCE level. It should also be noted that DCE is intended to operate without necessitating the overhead of TCP/IP. This offers a much simpler, low cost approach that does not require offload processing or accelerators.
An existing method for backward congestion control management may be understood with reference to FIG. 2. In the system shown in FIG. 2, a source node 220 transmits data packets to a destination node via a switch 210. The packets are transmitted via a virtual lane. When congestion occurs on the virtual lane for a given port, the switch 210 detects the congestion using a threshold comparison. This may involve measuring the volume of packets accumulated in the buffer of the switch and measuring an arrival rate of packet sequence numbers. Further details of packet sequence numbers are provided in commonly assigned U.S. patent application Ser. Nos. 11/847,965 and 11/426,421, herein incorporated by reference.
When congestion is detected by the switch 210, the switch sets an explicit congestion notification (ECN) bit in the header of the appropriate data packet being transmitted downstream towards the destination node 230. Upon receipt of a data packet including an ECN bit on, the destination node 230 sets a backward explicit congestion notification (BECN) bit on in the data packet header and transmits the data packet upstream towards the source node 220. When the source node 220 receives the data packet with the BECN bit on, the injection rate of data packets sent downstream is reduced. The source node 220 maintains a table 225 of inter-packet delays used for injection rate control. Each time a data packet is received at the source node 220 with a BECN on, the source node decrements the injection rate to the next lowest rate in the table, i.e., the source node uses the next longer delay in the table for injection rate control. The table index is decremented based on a timer. Whenever the timer expires without receipt of any additional data packets with BECN bits on, the source data rate is allowed to increase by one increment. Eventually, if no more data packets are received with BECN bit on, the injection rate fully recovers.
One problem with this approach is that the source data rate is only decreased after data packets including the ECN and the BECN have made almost one full round trip through the network. In order to accommodate the resulting travel time delays, the nodes must be designed with sufficiently large buffers. The buffer size, and associated node cost, can become quite large when it is desirable for the network to operate at extended distances (e.g., 10 s of kilometers, which is typical for disaster recovery applications). This type of congestion management places a fundamental limit on the maximum distance that can be achieved in a DCE network. Since DCE networks are not necessarily point-to-point, but may involve numerous hops and switch cascades, long cable distances may accumulate even if the source and destination are not geographically distributed far from each other.
Furthermore, the conventional congestion control approach is essentially a long delay feedback loop. Because it takes so long for the ECN/BECN notifications to reach their destinations, the feedback loop is not able to stay current with the state of the network. In fact, the feedback loop may lag behind other forms of credit based flow controls at the link layer.
For at least these reasons, it is desirable to reduce the response time of a backward congestion control scheme and thereby allow for longer distance links without excessively large buffer credits.