The present invention relates generally to computer networks, and more particularly to traffic congestion management in networks.
Congestion issues are common in networks, and particularly storage networks, due to the large data flows that they must support. In Fiber Channel networks, for example, congestion is typically managed through the use of link-based flow control mechanisms. Since there is no end-to-end flow control, head-of-line blocking of storage traffic is a common, anticipated phenomenon. Because the size of a typical Fiber Channel network is small in comparison to typical IP (Internet Protocol) networks, the impact and consequences of congestion and head-of-line blocking is limited and usually considered of minor significance.
However, with the introduction of iSCSI and iFCP technologies come the potential to significantly scale the size of storage networks. Rather than the 3-4 switches typical of storage networks in the past, iSCSI and iFCP allow practically unlimited scaling in the size of storage networks. In a large IP storage network consisting of hundreds of switches, a congestion issue has the potential to negatively impact the performance and reliability of a greater number of storage devices.
In addition, the use of IP introduces a greater number of link-level transports available to carry storage data, including, for example, Gigabit Ethernet, SONET, ATM, PPP, and DWDM. With the increase in types of physical transports come a much wider range of link speeds at which storage data is carried, leading to potential mismatches that may compound the impact of congestion issues. In particular, congestion caused by a relatively slow link such as a T-1 or T-3 link can cause rippling effects on the efficiency and utilization of adjacent gigabit-speed links, even if the low-speed link is rarely utilized.
Head-of-line blocking is an issue for any network technology that exclusively uses link-based flow control mechanisms to manage congestion for session-based network traffic. This allows for the effects of link-based flow control mechanisms, when triggered, to potentially impact sessions that are neither utilizing the congested link nor contributing to the congestion in any way.
FIG. 1 illustrates a basic example of head-of-line blocking in a network. As is shown, sessions to device C from devices A and D causes congestion in switch 10 when devices A and D attempt to send data at a rate which is higher than device C is able to receive. Assume that the links between the switch and each device shown in FIG. 1 are capable of carrying 1 Gbps (Gigabits per second) of data. Assume that devices A and D are attempting to each send 600 Mbps (megabits per second) of data to Device C. Switch 10 will be forced to buffer up some of the data since it will be receiving 1200 Mbps of data from devices A and D but can only forward data at a rate of 1000 Mbps to device C. Thus, switch 10 will be accumulating data until its internal buffering is exhausted when the link-level flow control mechanism between switch 10 and devices A and D will be invoked to slow the combined data rate to 1000 Mbps. Link-level flow control thus prevents the internal buffer of switch 10 from being overflowed, thereby preventing loss of data within the network. Assuming the ports are treated fairly, device A and D will each be limited to a 500 Mbps data rate. However, the link-level flow control mechanisms have no intelligence on which sessions (i.e., those directed to device C) are causing the flow control/congestion problem. Thus, other traffic that is not involved with the congestion, such as traffic from device A to device B, is affected by the link-level flow control.
All Fiber Channel fabrics rely exclusively on the Fiber Channel link-level buffer-to buffer credit mechanism, and are thus susceptible to head-of-line blocking issues. Until recently, Fiber Channel links were exclusively 1.0625 Gbps in throughput, and the uniformity in high-speed link throughput limited the occurrence of head-of-line blocking to those situations involving multiple session streams.
Internet Protocol (IP) can be used to internetwork many link-level networking protocols, each characterized by different link speeds. For example, IP allows ethernet networks to be internetworked with other protocols such as ATM, Token Ring, SONET, PPP, etc . . . IP is “link-neutral”, meaning it doesn't care what link technology is used. Due to the heterogeneity of IP transports, an end-to-end flow control mechanism such as Transmission Control Protocol (TCP) is recommended, and a heavy reliance on link-level flow control is recognized as having unintended side-effects.
The introduction of IP-based transports for connecting Fiber Channel devices or interconnecting Fiber Channel networks introduces serious congestion management issues. For example, since Class 3 Fiber Channel does not have an end-to-end flow control mechanism, it must rely on link-level flow control to manage congestion and reduce packet loss. Unfortunately, this potentially raises serious head-of-line blocking issues when used with IP, since many link-level technologies used for IP are relatively slow in their link throughput compared to native Fiber Channel. Unless an end-to-end flow control mechanism is introduced, a single storage session can result in serious head-of-line blocking effects that may affect traffic in the local fabric.
FIG. 2 illustrates the head-of-line blocking phenomenon as a result of a slow WAN link. Storage traffic simply backs up when it must ingress a slow-speed WAN link. For example, the introduction of slow speed IP links, such as T1 (1.544 Mbps), can have rippling effects on congestion. As shown in FIG. 2, for example, a storage session from Device A to Device D is encapsulated in IP datagrams for transmission over a slower WAN link 20. Switch #2 receives data at a faster rate than it can send over WAN link 20, and thus initiates flow control to the upstream Switch #1. Doing so, however, affects other non-related traffic that flows across the inter-switch link (ISL) between Switch #1 and Switch #2, such as traffic from Device B to Device C, which competes with the WAN session traffic for available bandwidth on the ISL. Thus, the triggered link-level flow control not only impacts traffic destined for WAN link 20, but also the local high speed traffic from device B to device C. Instead, traffic from Device B to Device C will most likely have a similar throughput as that achieved over the slower WAN link 20.
Congestion caused by head-of-line blocking may also result when Storage Networks using link level flow control are connected using high speed links such as Gigabit Ethernet or 10 Gigabit Ethernet when the protocol used on the high speed links is TCP/IP. TCP (Transmission Control Protocol) includes congestion control mechanisms as part of the protocol which dynamically change the rate at which data may be transmitted. Therefore, a high speed link connected to an IP network using TCP may operate at a relatively low speed depending on the characteristics of the IP network. The data rates which can be transmitted can vary widely from the full link bandwidth (e.g. 1 Gbps or 10 Gbps) down to a few Kbps (kilobits per second).
FIG. 13 shows a SAN 300 which uses an IP network 310 to interconnect Local SAN A 360, Local SAN B 365 and an initiator system or device 350 which uses iSCSI as a storage protocol. The various devices in each of the local SANs are interconnected with a link level flow control based network such as Fiber Channel. The protocols used to interconnect the local SANs to the IP network 310 may be any TCP/IP based storage protocol such as, for example, iFCP, FCIP and iSCSI. The same congestion problems which can occur when 2 local SAN networks are interconnected with a slow speed WAN link can also occur in the SAN 300 shown in FIG. 13 because the high speed links connecting switches B and C to the IP network 310 may have only a fraction of their bandwidth utilized. The usable bandwidth on the high speed links may be limited by the TCP/IP protocol which reduces the bandwidth used when it detects frames being discarded in the IP network. The used bandwidth will gradually be increased until frames are once again dropped. However, the recovery from dropped frames by TCP often results in no data being transmitted for periods of 1 or more seconds.
The introduction of data links that have a high latency, such as IP-based WAN links, can result in a significant degradation in write performance. Read performance can also be negatively impacted but typically to a lesser extent than write performance. The drop in performance is typically due to handshaking within the protocol used to carry the SCSI commands. FIG. 9 shows an example of a SCSI read command using FCP (Fiber Channel Protocol) in a low latency network. In this example, an initiator 335 issues a read command (FCP_CMD) to a target device 345 requesting that the target return a specific group of data. The target returns the requested data in (FCP_DATA) packets on the network followed by the command status (e.g., in an FCP_RSP frame). In a low latency network, the time required for the read command to complete is dominated by the time required by the target to process the command, retrieve the requested data from memory media (e.g., disk drives, magnetic tape, etc.) and transmit the data and status in packets to the initiator. The addition of latency in the network increases the time required to complete a read command by the total round trip time (RT) of the network (a network with a latency of 5 ms in each direction has an RT of 10 ms). FIG. 10 shows how the read command is affected by network latency in a high latency network. The time required for the read command to complete is increased by RT. For example, if a read command would normally complete in 10 ms (milliseconds) in a network with no latency, the same command would require 60 ms to complete in a network with 50 ms of latency.
FIG. 11 shows an example of a SCSI write command performed using FCP in a low latency network. The initiator 335 issues the write command request (FCP_CMD). The target 345 receives the write command and returns an FCP_XFER_RDY frame when it is ready to accept the data for the write command. The target indicates in the FCP_XFER_RDY frame the amount of the command write data it is requesting from the initiator. The target may request any amount of the data. The initiator sends the requested data to the target in FCP_DATA frames. When the target receives all of the requested data, it either requests additional data by issuing another FCP_XFER_RDY frame if all of the data has not yet been received or returns the SCSI status in an FCP_RSP frame completing the SCSI command. For example, if the initiator issued a 256 KB (kilobyte) write command, the target could return a XFER_RDY frame requesting 64 KB of the data. When this data was transferred, the target could request another 64 KB, then another 64 KB until all of the data was transferred. When all of the data has been transferred, the target would reply with an FCP_RSP frame indicating the status for the SCSI command. Alternatively, the target could have issued a single request for 256 KB of data or for any combination of requests which summed to 256 KB. Note that the target can not issue another request for data until it has received the data from an earlier data request.
FIG. 12 shows the effect of performing write commands over a network with high latency. The network latency has a greater effect on the time required to complete write commands than for read commands due to the additional handshakes between the initiator 335 and target 345. The write command completion time will be delayed an additional N*RT where N=the number of FCP_XFER_RDY+1 issued by the target. For example, assume a write command would complete in 10 ms in a network with no latency. The same command on a network with a 50 ms RT would require 110 ms to complete if the target issues a single FCP_XFER_RDY frame. The required time could be much higher if the target issues multiple FCP_XFER_RDY frames. For example, if the target issued 4 XFER_RDY frames, the delay would increase to 260 ms.
It is therefore desirable to provide congestion management systems, methods and software that avoid or significantly reduce the effects of head-of-line blocking and network latency. Such technologies should allow for the full and efficient utilization of slow-speed and/or high latency links within a network, e.g., storage area network (SAN), without impacting the performance of upstream high-speed links.