In FC SANs, hosts (typically implemented as servers) perform all I/O operations through a host bus adapter (“HBA”) and associated driver installed on the host. Applications running in user space on the host use the storage stack for performing data I/O operations. Typical storage stacks of the host operating system (“OS”) include a Virtual File System (“VFS”)/file system layer in the kernel, and a data block handling layer below it. Below the data block handling layer is a Small Computer Systems Interface (“SCSI”) mid-layer that translates the block requests into SCSI commands. The SCSI commands are then transferred to a OS-specific FC driver provided by the HBA vendor that hooks into generic interfaces of the SCSI mid-layer to perform read and write operations using FC transport protocol semantics. The FC driver provides all FC Layer 2 (“L2”) functionalities, such as Exchange management. The driver usually also downloads a firmware into flash memory of the HBA during installation. The firmware takes care of all the FC layer 0,1 functionalities like FC framing, signaling, and transmission (“Tx”)/receipt (“Rx”) of frames through a small form-factor pluggable (“SFP”)/transceiver in the HBA. The firmware interrupts the driver for events like frame Rx, loss of signal, etc.
FC is a credited network with no-drop characteristics specifically modelled for storage transport and the host storage stacks typically demand this no-drop behavior; however, issues such as faulty or incorrectly configured hardware, firmware settings, software bugs, for example, in the I/O path can cause frame drops. The FC HBA drivers provide for a configurable number of retries per SCSI command, after which the I/O operation is returned to the host stack as a failure. This condition will initially start showing up as slow application response times on the hosts and if seen on a sustained basis can have detrimental effects like application stall due to stretch of the resources that cache outstanding I/O on the host. Some host stacks might just crash as the I/O queue builds up with no/slow responses. While some hosts can handle it gracefully by logging the event in server logs, restart of the application or the entire OS may be required to recover from a stuck or slow I/O condition.
To handle this type of eventuality without negatively impacting applications, FC SANs built for five 9s reliability are designed with multiple redundant paths (multiple ports in the HBA OR dual HBAs). A path failover is initiated by the host multipathing software when a slow/stuck I/O path is detected. After the path failover, the host moves all I/O to a different path (typically attached to a different port in a different SAN) so that the application can quickly recover from the stuck/slow I/O condition. However, the SAN is still vulnerable due to a failed path in the system. In order to not compromise the network resiliency aspect for a long period of time, the earlier active path should be debugged and faulty component replaced as quickly as possible. Additionally, before new FC devices are added into a production SAN, a last minute “on-switch” component testing may be desirable to sanitize the HBA, switch port, SFPs and cable.