1. Field of the Invention
The invention applies to any networking architecture where isolating error occurrences are critical to correctly identifying faulty hardware in the network environment.
2. Description of the Prior Art
As networks continue to become increasingly sophisticated and complex, qualifying fault indications and isolating their sources is becoming a vexing problem. Some devices have services that indicate faults, either ones occurring in the device itself or observed by the device as occurring elsewhere. Other devices, however, may not indicate faults, due to poor design, prioritizing schemes, pass-thru mechanisms that do not permit the discovery of faults that occurred elsewhere, etc. This is further complicated by the wide variety of devices, vendors, models, hardware versions, software versions, classes, etc. The unfortunate result is that no viable way to evaluate fault indications for determination of their operational relevance and root sources in hierarchical or canonical heterogeneous optical networks exists.
FIG. 1 (background art) is a block diagram depicting a generalized storage network infrastructure. This network 10{XE “network 10”} includes blocks representing switch groups 12{XE “switch groups 12”}, hosts 14{XE “hosts 14”}, and storage enclosures 16{XE “storage enclosures 16”}. In a switch group 12{XE “switch group 12”} there can be any number of switches, from 1 to n, containing any number of ports, 1 to m. In some cases these may include a director class switch that all of the other switches are directly connected to, or there may be multiple switches cascaded together to form a pool of user ports, with some ports used for inter-switch traffic and routing (described presently). The hosts 14{XE “hosts 14”} can be of any type from any vendor and having any operating system (OS), and with any number of network connections. The storage enclosures 16{XE “storage enclosures 16”} can be anything from a tape library to a disk enclosure, and are usually the target for input and output (I/O) in the network 10{XE “network 10”}.
Collectively, a single switch group 12{XE “switch group 12”} with hosts 14{XE “hosts 14”} and storage enclosures 16{XE “storage enclosures 16”} are “local devices” that are either logically or physically grouped together at a locality 18{XE “locality 18”}. Some of the devices at a locality 18{XE “locality 18”} may be physically located together and others may be separated physically within a building or a site.
The hosts 14{XE “hosts 14”} are usually the initiators for I/O in the network 10{XE “network 10”}. For communications within a locality 18{XE “locality 18”}, the hosts 14{XE “hosts 14”} and storage enclosures 16{XE “storage enclosures 16”} are connected to the switch group 12{XE “switch group 12”} via local links 20{XE “local links 20”}. For more remote communications, the switch groups 12{XE “switch groups 12”} are connected via remote links 22{XE “remote links 22”}.
In FIG. 1, three localities 18{XE “localities 18”} are shown, each having a switch group 12{XE “switch group 12”}. These localities 18{XE “localities 18”} can be referenced specifically as localities 18a-c{XE “localities 18a-c”}. As can be seen, communications from locality 18a{XE “locality 18a”} to locality 18c{XE “locality 18c”} must go via locality 18b{XE “locality 18b”}, hence making the example network 10{XE “network 10”} in FIG. 1 a multi-hop storage network.
All of the devices in the network 10{XE “network 10”} are ultimately connected, in some instances through optical interfaces in the local links 20{XE “local links 20”} and the remote links 22{XE “remote links 22”}. The optical interfaces include multi mode or single mode optical cable which may have repeaters, extenders or couplers. The optical transceivers include devices such as Gigabit Link Modules (GLM) or GigaBaud Interface Converters (GBIC).
In Fiber Channel Physical and Signaling Interface (FC-PH) version 4.3 (an ANSI standard for gigabit serial interconnection), the minimum standard that an optical device must meet is no more then 1 bit error in 10^12 bits transmitted. Based on 1 Gbaud technology this is approximately one bit error every fifteen minutes. In 2 Gbaud technology, this drops to 7.5 minutes, and in 10 Gbaud technology, to 1.5 minutes. If improvements to the transceivers are made so that the calculation assumes one bit error in every 10^15 bits, at 2 Gbaud, this is approximately one bit error every week. Also, optical fiber in an active connection is never without light, so bit errors can come inside or outside of a data frame and each optical connection has at lease two transceiver modules which doubles again the probability for a bit error. Furthermore, each interface, junction, coupler, repeater, or extender, has the potential of being unreliable, since there are dB and mode losses associated with these connections that degrade integrity of the optical signal and may result in data transmission losses due to the increased cumulative error probabilities.
Unfortunately, determining the sources of errors, and thus determining where corrective measures may be needed if too many errors are occurring in individual sources, can be very difficult. In storage network environments that use cut-through routing technology, an I/O frame with a bit, link or frame level error that has a valid address header can be routed to its destination, forcing an error counter to increment at each hop in the route that the frame traverses. Attempting to isolate where this loss has occurred in a network that may have hundreds of components is difficult and most of the time is a manual task.
All the losses that have been described herein are also “soft” in nature, meaning that, from a system perspective, no permanent error has occurred and there may not be a record of I/O operational errors in a host or storage log. The only information available then is the indication of an error with respect to port counter data, available at the time of the incident.
As networks evolve, the ability to isolate faults in these networks must also evolve as fast. The ability to adjust to this change in storage networking environments needs to come from an external source and to be applied to the network without the need for interruption by the monitoring system that is employed.
FIG. 2 (background art) is a block diagram depicting the generalized multi-hop network 10{XE “network 10”} of FIG. 1 with errors. An error event has occurred on the remote link 22{XE “remote link 22”} shown emphasized in FIG. 2. This could have been a CRC error or other type of optical transmission error. The error here was reported on the two hosts 14{XE “hosts 14”} and the one storage enclosure 16{XE “storage enclosure 16”} which are also shown as emphasized in FIG. 2.
What is needed is a system able to correlate that these three separately recorded events in the network 10{XE “network 10”} were all caused by a single event. And if the event continues, to notify a user of the fact that it was not a host 14{XE “host 14”} or the storage enclosure 16{XE “storage enclosure 16”} that was faulting but, rather one of the paths in the remote link 22{XE “remote link 22”} in the network 10{XE “network 10”}, aside of the hardware at the endpoints within the localities 18{XE “localities 18”}. The proposed system therefore needs to take fault indications and isolates those to the faulting link. A link is described as the relationship between two devices and is shown in the following FIG. 3.
FIG. 3 (background art) is a block diagram depicting a single optical link, comprising two optical transceivers 24{XE “transceivers 24”} and the local link 20{XE “local link 20”} or remote link 22{XE “remote link 22”} connecting them. The cable is depicted as twisted to represent that the transmitter 26{XE “transmitter 26”} of one optical transceiver is connected directly to the receiver 28{XE “receiver 28”} of an opposing optical transceiver. All of the hosts 14{XE “hosts 14”}, storage enclosures 16{XE “storage enclosures 16”}, and switch groups 12{XE “switch groups 12”} have optical transceivers 24{XE “transceivers 24”} connecting the local links 20{XE “local links 20”} and remote links 22{XE “remote links 22”}. There can be any number of paths in these links 20, 22{XE “links 20, 22”} with each path having two directions. For each direction there is one transmitter 26{XE “transmitter 26”} and one receiver 28{XE “receiver 28”}, as represented in FIG. 3.
It is, therefore, an object of the present invention to provide a system for fault isolation in a storage area network. Other objects and advantages will become apparent from the following disclosure.