1. Field of the Invention
The present invention generally relates to the field of network switches. More particularly, the present invention relates to a system and method that allows a network switch coupled to an arbitrated loop such as an Fibre Channel Arbitrated Loop (FC-AL) to transmit queued packets to a device on the arbitrated loop when opened by the device, and thus to utilize the arbitrated loop in full-duplex mode when possible.
2. Description of the Related Art
In enterprise computing environments, it is desirable and beneficial to have multiple servers able to directly access multiple storage devices to support high-bandwidth data transfers, system expansion, modularity, configuration flexibility, and optimization of resources. In conventional computing environments, such access is typically provided via file system level Local Area Network (LAN) connections, which operate at a fraction of the speed of direct storage connections. As such, access to storage systems is highly susceptible to bottlenecks.
Storage Area Networks (SANs) have been proposed as one method of solving this storage access bottleneck problem. By applying the networking paradigm to storage devices, SANs enable increased connectivity and bandwidth, sharing of resources, and configuration flexibility. SANs are typically implemented using Fibre Channel devices and Fibre Channel switches. Fibre Channel is a serial data transfer architecture designed for mass storage devices and other peripheral devices that require very high bandwidth.
Fibre Channel defines three topologies, namely Point-to-Point, Arbitrated Loop, and Fabric. Fibre Channel Arbitrated Loop (FC-AL) has become the most dominant Fibre Channel topology. FC-AL is capable of connecting up to 127 ports in a single network without the need of a fabric switch (also referred to herein as a network switch). However, a network switch may be installed at a port of an FC-AL (typically port 0) to interface the FC-AL to other FC-ALs, fabrics, etc. in a SAN. In an FC-AL, unlike the other two topologies, the media is shared among the devices, limiting each device's access. Unlike token-passing schemes, there is no limit on how long a device may retain control of an FC-AL. This demonstrates the “channel” aspect of Fibre Channel. There is, however, an optional Access Fairness Algorithm, which prohibits a device from arbitrating again until all other devices have had a chance to arbitrate.
Like most ring topologies, devices in an FC-AL may be connected to a central hub or concentrator. The cabling is easier to deal with, and the hub can usually determine when to insert or de-insert a device. Thus, a “bad” device or broken fiber (e.g. fiber optic cable) won't keep the entire network down.
Before an FC-AL is usable, it must be initialized so that each port obtains an Arbitrated Loop Physical Address (AL_PA), a dynamically assigned value by which the ports communicate. The AL_PA is a 1-byte value used in the Arbitrated Loop topology used to identify Loop Ports (L_Ports). L_Port is a generic term for any Fibre Channel port that supports the Arbitrated Loop topology. During initialization, a Loop master is selected that will control the process of AL_PA selection. If a network switch is present on the FC-AL, it will become Loop master; otherwise, the port with the numerically lowest Port Name will be selected as Loop master. Ports arbitrate for access to the Loop based on their AL_PA. Ports with lower AL_PAs have higher priority than those with higher AL_PAs.
In an FC-AL, when a device is ready to transmit data, it first must arbitrate and gain control of the Loop. It does this by transmitting an Arbitrate primitive signal, which includes the Arbitrated Loop Physical Address (AL_PA) of the device. Once a device receives its own Arbitrate primitive signal, it has gained control of the Loop and can now communicate with other devices by transmitting an Open primitive signal to a destination device. Once this happens, there exists a point-to-point communications channel between the two devices. All other devices in between the two devices simply repeat (e.g. retransmit) the data.
Fibre Channel flow control is based on a credit methodology where a source port must have a positive credit before transmitting a packet. The scheme works as follows when connected to an arbitrated loop. An arbitrated loop port receives (and provides) a BB_CREDIT value from (to) each device that they login to. This BB_CREDIT value represents the number of buffers that the port will have available when a new circuit is established. A port is allowed to transmit (upon establishing a new circuit), the number of data frames defined by BB_CREDIT without receiving R_RDY primitives. However, the port must then wait until R_RDY primitives have been received that equal the number of data frames transmitted. The port may then transmit a data frame only if the port has received more R_RDY primitives than transmitted data frames.
Note that a value of 0 is allowed for BB_CREDIT that indicates that the port cannot transmit more data frames than R_RDY primitives received. When a port supplies a positive value of BB_CREDIT, the port is guaranteeing that BB_CREDIT buffers will be available when the circuit is established. For a nonzero value, this implies that the circuit will not be closed unless there are BB_CREDIT buffers available to ensure that if another circuit is established immediately, the port will not be short of buffers.
FIG. 1A is a block diagram illustrating an exemplary topology of a Fibre Channel Arbitrated Loop (FC-AL) 702 coupled to a network 700 (e.g. SAN) via network switch 710. The connection to network 700 is typically to an FC point-to-point, FC fabric, or another FC-AL, which in turn may link to other FC topologies or alternatively may be bridged to other data transports (e.g. Ethernet, SCSI) that together make up the SAN. Six devices, including network switch 710 and devices 712A–712E, are shown in the FC-AL 702. Data flows in only one direction on the FC-AL 702, as illustrated by the direction of the arrows connecting the devices in the loop. Data sent from one device to another device on the FC-AL 702 must pass through any and all devices between the two devices in the downstream direction. For example, if device 712C needs to send data to device 712E, the data is first passed to device 712D, which retransmits the data to device 712E. Also note that the network switch may have other connections that are not shown.
FIG. 1B is a flow diagram illustrating packet flow in an FC-AL 702, and shows a hub 714 used to interconnect the devices at port 0 through port 5. In this example, a network switch at port 0 couples the FC-AL 702 to the network 700. Note that data on the FC-AL 702 as illustrated in FIG. 1B may flow in only one direction on the FC-AL 702, as illustrated by the direction of the arrows connecting the devices to the hub 714. Data sent from one port to a second port on the FC-AL 702 must pass through any and all ports between the two ports in the downstream direction. For example, if port 0 needs to send data to port 3, it first arbitrates to gain control of the loop, then opens the device at port 3, and then transmits the data (through the hub 714) to port 1. The data is then retransmitted through the hub to port 2, and then finally to port 3, which receives the data (without retransmitting).
Referring again to FIG. 1A, only one device can gain control of and hold the FC-AL 702 at a time. A device first arbitrates for the FC-AL 702. When the device gains control of the loop, it opens a second device. The first device may then send frames of data (also referred to as packets) to the second device. In some instances, if the second device has packets for the first device, it may send the packets to the first device via FC-AL 702 after being opened by the first device and while receiving packets from the first device. When two devices are transmitting to each other simultaneously, the FC-AL is operating in full-duplex mode. When a first device is transmitting to a second device, and the second device is not transmitting, the FC-AL is operating in half-duplex mode. Obviously, for maximizing bandwidth utilization of the fibre, it is advantageous for the FC-AL 702 to operate in full-duplex mode as much as possible.
Network switch 710 serves as an interface between FC-AL 702 and network 700. Network switch 700 may receive FC packets from a device 712 on the FC-AL 702 that are destined for one or more devices on network 700, and then may retransmit the packets on network 700 to the one or more devices. Network switch 700 may also receive packets from a device on network 700 and then route the packets to the destination device 712 of the packets on the FC-AL 702.
In connecting to devices on the FC-AL 702, network switch 710 behaves similarly to the other devices 712 on the FC-AL. Switch 710 must arbitrate for the loop and, when it gains control, open a device 712 to transmit to. Likewise, a device 712 may open network switch 710 after gaining control of the loop. Since network switch 710 may have to wait to gain control of the FC-AL 702 to transmit packets to a device 712, or conversely may have to wait to transmit packets from a device 712 on FC-AL 702 to a device on network 700, network switch 710 typically includes buffer memory for storing packets waiting to be transmitted.
FIG. 2 is a data flow diagram illustrating a prior art network switch 710 opening a device 712N on an FC-AL. At 730, network switch 710 first arbitrates for and gains control of the FC-AL, and then opens device 712N to begin transmitting incoming packet(s) 720 to the device. Packets 720 may have been previously received by fabric 710 from a source device on network 700. When network switch 710 opens device 712N, the device may have data to send to switch 710. Device 712N may transmit the data to switch 710 in outgoing packet(s) 722 while receiving the incoming packet(s) 720 from switch 710. Thus, the FC-AL may be utilized in full-duplex mode when network switch 710 opens a device 712.
FIG. 3 is a data flow diagram illustrating a prior art network switch being opened by a device. At 732, device 712N on an FC-AL first arbitrates for and gains control of the FC-AL, and then opens the network switch 710 to begin transmitting outgoing packet(s) 722 to network switch 710.
Network switch 710 may have data queued for device 712N when opened by the device. However, when opened by device 712N, network switch 710 is not able to determine if it has queued data for the device 712, or to transmit the queued data to the device 712N concurrent with receiving outgoing packets 722 from the device. Prior art network switches, when operating in full duplex mode, may be blocked from sending data because data for another device on the loop is “blocking” access, thus limiting the efficiency of use of bandwidth on the FC-AL in full duplex mode.
Frame Ordering and Network Switch Performance on an Arbitrated Loop
An arbitrated loop may generally be defined as a set of devices that are connected in a ring topology as in the example FC-AL shown in FIG. 1A. The arbitrated loop protocol requires all devices on the loop to arbitrate for control of the loop. A device will arbitrate for control of the loop when it has data frames it wishes to send to another device on the loop. The device, when it wins arbitration, will then establish a connection to the device it wishes to transfer data. After all desired data frames are transferred, the loop is “closed”. The device that controls the loop may then give up the loop for arbitration or open another device to transfer data frames. The following summarizes the arbitrated loop process:
a) Arbitrate for control of the loop.
b) Wait to win arbitration.
c) Open a connection with the destination device when arbitration is won.
d) Exchange data frames with the destination device.
e) Close the connection.
f) Release the loop for arbitration OR repeat steps c–e
The loop is utilized for transferring data only during step c). The remaining steps represent protocol overhead that tends to reduce the overall usable bandwidth on the arbitrated loop.
Prior art network switches typically have a single queue for holding frames to be output to the arbitrated loop. The order of frames on the queue determines the order in which frames are output to the arbitrated loop and hence the ordering of arbitration-open-close cycles which need to be performed. In some conditions, loop utilization may be less than optimal. For example, if there are frames in the queue for two or more devices and the frames from the devices are interleaved, the overhead for opening and closing devices may reduce the utilization of the loop bandwidth by an amount that may depend on average frame sizes and on the order of the frames on the queue.
For example, consider the case where the frames are ordered as shown in FIG. 4A. In this figure, the letters A and B represent frames on the queue for devices A and B on the loop. The ordering of frames in the queue of FIG. 4A forces the switch to transfer only one frame per each establishment of a connection. Processing of the frames may be as follows (assuming the switch holds the loop for an extended period of time before allowing arbitration to occur):
a) Arbitrate
b) Open Device A
c) Transfer Data Frame
d) Close Device A
e) Open Device B
f) Transfer Data Frame
g) Close Device B
h) Repeat b-d
i) Repeat e-g
j) Continue until queue empty or maximum time loop can be held occurs.
The loop utilization in this example may thus be less than optimal. The overhead for opening and closing devices may reduce the utilization of the loop bandwidth, for example, by 10–30% depending on average frame sizes.
FIG. 4B illustrates a more optimal frame ordering when compared to the frame ordering of FIG. 4A which may have reduced loop overhead since the switch may send multiple frames each time a device is opened or closed. However, the frame transmit scheduling logic used in network switches and other devices that carry IP (Internet Protocol) traffic are typically designed to generate traffic (e.g. packet or frame flow) with low jitter. As used herein, the term “jitter” relates to the transmission of frames from a source to a destination. “Low jitter” includes the notion of frames being transmitted and received in a steady flow, and implies that the temporal spacing between the frames at the receiver remains as constant as possible. Thus, prior art network switches typically use a low-jitter scheduling algorithm that attempts to interleave traffic from different sources as much as possible. This interleaving may result in the frames typically arriving at the network switch in a less than optimal ordering (e.g. more like FIG. 4A than FIG. 4B). Therefore, it may be desirable to implement a scheduling algorithm for a network switch specifically when interfacing an arbitrated loop such as an FC-AL with an IP network that carries low-jitter traffic.
Transfer Ready (XFER_RDY) Delay and Write Performance
In a Storage Area Network (SAN), a host bus adapter, e.g. a Fibre Channel host bus adapter, may be connected to a network switch performing a mixture of read/write transfers to multiple disk drives. Under some conditions, the write performance may be considerably lower than the read performance. While read performance under these conditions is typically as expected, write performance may be considerably less than expected. When only write operations are performed, the performance for the write operations is typically as expected. The reduced write performance during combined read and write operations may be the result of a large buffer within the network switch that causes the delivery of transfer ready (XFER_RDY) frames to be delayed when both write and read operations are being performed.
To understand the implication of delaying the delivery of XFER_RDY frames, it is necessary to understand the protocols for read and write operations by devices using FCP (Fibre Channel Protocol for SCSI). FCP uses several frame sequences to execute a SCSI command between the initiator of a command (the initiator) and the target of the command (the target). An example of an initiator is a host bus adapter such as a Fibre Channel host bus adapter and an example of a target is a storage device such as a disk drive. The initiator and target communicate through the use of information units (IUs), which are transferred using one or more data frames. Note that an IU may consist of multiple data frames but may be logically considered one information unit. The IUs for FCP may include, but are not limited to, the following:                FCP_CMND—The FCP_CMND IU is sent from an initiator to a target and contains either a SCSI command or a task management request to be executed by the target.        FCP_XFER_RDY—The FCP_XFER_RDY IU is sent from a target to an initiator for write operations and indicates that the target is ready to receive part or all of the data for a write command.        FCP_DATA—The FCP_DATA IU is sent from an initiator to a target for write commands and from targets to initiators for read commands. An FCP_DATA IU consists only of the actual SCSI command data.        FCP_RSP—The FCP_RSP IU is sent from a target to an initiator and contains the SCSI status, Sense information (if any), protocol status and completion status of task management functions.        FCP_CONF—The FCP_CONF IU is sent from an initiator to a target and provides confirmation that the initiator received the FCP_RSP IU. This IU is optional.        
FIG. 5 shows an example of the processing of an FCP Read command. The initiator 200 sends the read command in an FCP_CMND IU to the target 210. When the target 210 has the data available, it returns the data to the initiator 200 in one or more FCP_DATA IUs. When all of the data has been transmitted, the target 210 sends an FCP_RSP IU with the command status information. The initiator 200 may optionally send an FCP_CONF IU to the target 210 indicating that the FCP_RSP IU was received. When an initiator 200 issues the read command, it must be prepared to receive all of the data indicated by the command (i.e. buffer(s) must be available for the returned data).
FIG. 6 shows an example of an FCP write command. The initiator 200 sends the write command to the target 210 in an FCP_CMND IU. The target 210 responds with an FCP_XFER_RDY IU indicating the data it is ready to accept. The initiator 200 then sends the data to the target in a single FCP_DATA IU. After all of the data requested by the target 210 has been transferred, the target 210 will either send another FCP_XFER_RDY IU requesting additional data or send an FCP_RSP_IU containing the command status information. The initiator 200 may optionally send an FCP_CONF to the target 210 indicating that the FCP_RSP IU was received. (Note that the FCP_DATA IU may consist of multiple data frames but is logically considered one information unit.)
Preferably, when an initiator 200 issues a write command, the FCP_DATA IU can be returned as soon as the initiator 200 receives the FCP_XFER_RDY IU from the target 210. If an initiator 200 is performing overlapping write commands (i.e. there are multiple outstanding write commands), it can maintain a constant flow of FCP_DATA IU frames as long as it has received at least one XFER_RDY IU for which it has not yet transmitted the data. However, if the FCP_XFER_RDY IU is delayed, the initiator 200 will not maintain a constant flow of output data when it is waiting for an XFER_RDY IU to transmit data.
When only write operations are performed, the XFER_RDY IU see little delay because only FCP_RSP and FCP_XFER_RDY IUs are being sent from the targets to the initiator. The FCP_RSP IUs have little effect on the FCP_XFER_RDY latency because only one FCP_RSP IU is received per SCSI command and the FCP_RSP IUs are small. However, when read and write operations are performed simultaneously, the initiator 200 will also be receiving FCP_DATA IU from the target(s) 200. For typical SCSI commands (e.g. 8K byte to 64 Kbyte commands), there can be a lot of FCP_DATA frames waiting in network switch queues to be forwarded to the initiator 200. Thus, the XFER_RDY IU may be significantly delayed due to queuing of data frames by network switches. Thus, write performance can be degraded significantly when performing a combination of read and write commands. In larger networks, write performance may be degraded when XFER_RDY IUs are delayed due to other traffic, therefore the write performance degradation may not be limited to instances where an initiator 200 is performing both read and write operations.
FIG. 7 illustrates how XFER_RDY IUs can be delayed due to network switch queuing. The amount of switch queuing 300 may affect the latency of XFER_RDY IUs being returned to an initiator 200. Network switches with small amounts of buffer memory (i.e. small queues 300) may experience fewer problems than network switches with larger amounts of buffer memory (i.e. larger queues 300) because the XFER_RDY IUs may be delayed less within a switch with a small queue 300. Prior art Fibre Channel switches typically have small amounts of buffer memory and therefore this problem may not appear in these switches. Network switches that support multiple network protocols may be more susceptible because they contain more buffering to support the other protocols. For example, a network switch that supports Fibre Channel and Ethernet may have buffering for 512 frames per port while prior art Fibre Channel-only switches may have buffering for only 16 to 32 frames.