The present invention relates to computer storage, and more specifically, to multipath driver cognitive coordination to mollify the impact of storage area network (SAN) recurring intermittent failures.
Complex SAN configurations have become prevalent in many computer systems. SANs enable a large number of servers to access shared storage via a switch network, often a fiber channel. The switches used to interconnect servers to shared storage are critical to the reliable operation of the system. The network is often architected and configured to be fully fault tolerant with a high degree of redundancy so that both solid and intermittent path failures can be detected, and automatic in-line recovery initiated to remedy the problem or reroute data packets around faults to prevent system outage and loss of access to data.
Traditionally faults are often viewed as falling into one of two categories, permanent and temporary faults. Solid faults resulting from complete failure of a hardware component are perhaps the easiest to understand. These types of failures are normally permanent, where fault tolerance and recovery are accomplished via redundancy, and alternate paths through the different hardware in the network are utilized to circumvent the fault. A solid failure in a network is typically recovered by taking the path offline and channeling packets through a redundant path interconnecting a server to storage.
The second types of traditional faults are referred to as temporary intermittent failures as they are temporary and transient in nature. These type of failures can arise from numerous sources including bit flips in electronics due to alpha particle or cosmic rays, electrical noise from intermittent contacts, fiber optic transceivers (e.g., small form-factor or “SFP”) starting to lose light intensity, or code defects to name a few. These can produce temporary faults which are normally viewed as one time incidents that can be remedied via a single recovery action such as a path retry operation. For both extender ports (E-ports) and fabric ports (F-ports), intermittent errors can cause many different events such as state changes, protocol errors, link reset, invalid words, cyclical redundancy checks (CRCs), and class-3 discards (C3TX_TO), as well as other conditions. Since data packets flowing from servers to shared storage traverse a large number of switches and links interconnecting the switches and devices, the precise component(s) associated with the faults may not be apparent.
The underlying problem in a SAN often does not produce a red light error indication so symptoms of a failure may be limited to symptoms such as a small computer system interface (SCSI) command time-out visible at the server. Since the paths from servers to shared storage do not have a permanent affinity with specific switches and links between switches, the failure may surface only intermittently even in the presence of a recurring link failure. Even specific paths themselves may fail intermittently because they share inter-switch links (ISLs) between switches which use different ISLs dynamically. An intermittent network failure can be elusive and difficult to pinpoint and isolate when a command timeout operation via higher level software is the only indication.
Thus, a third category of failures, pervasive intermittent faults, has surfaced as difficult to isolate and resolve since the underlying problem cannot be contained within the network itself and in most cases the network is not capable of producing actionable fault indications that would enable prompt response and resolution from the network administrator. As the transmission rates and complexity of high speed networks has continued to increase over time, this third type of fault has become more common and problematic with SAN operation. Pervasive intermittent faults are not one time events, but are normally not solid component failures either. These faults are intermittent in nature, but reoccur soon after recovery is believed to have been completed successfully. This can put the system into reoccurring recovery loop known as a recovery storm, placing repeated back pressure on the network. Repeatedly stopping or slowing network traffic can eventually cause application level performance issues and even application failure. It can also cause false triggers resulting in servers failing over unnecessarily to back up servers.