1. Field of the Invention
This invention relates to a method and apparatus for improving reliability of networked subsystems. In particular, the invention relates to improved error avoidance in redundant data path subsystems, such as dual-loop Fibre Channel Arbitrated Loops. The invention could equally apply to other redundant data path subsystems.
2. Description of the Related Art
For brevity and clarity, the invention and its preferred embodiments will be described against a background of the Fibre Channel Arbitrated Loop architecture, but it will be clear to one skilled in the art that they are not limited to use in such an environment. In particular, it will be readily apparent to one skilled in the art that many other redundant data path subsystems exist in the field of data processing, and that the systems and methods of the present disclosure would have equal applicability in those subsystems.
Fibre Channel Arbitrated Loop (FC-AL) architecture is a member of the Fibre Channel family of ANSI standard protocols. FC-AL is typically used for connecting together computer peripherals, in particular disk drives. The FC-AL architecture is well-known to those skilled in the art, and need not be described in great detail herein.
Electronic data systems can be interconnected using network communication systems. Area-wide networks and channels are two technologies that have been developed for computer network architectures. Area-wide networks (e.g. LANs and WANs) offer flexibility and relatively large distance capabilities. Channels, such as the Small Computer System Interface (SCSI), have been developed for high performance and reliability. Channels typically use dedicated short-distance connections between computers or between computers and peripherals.
Fibre Channel technology has been developed from optical point-to-point communication of two systems or a system and a subsystem. It has evolved to include electronic (non-optical) implementations and has the ability to connect many devices, including disk drives, in a relatively low-cost manner. This addition to the Fibre Channel specifications is called Fibre Channel Arbitrated Loop (FC-AL).
Fibre Channel technology consists of an integrated set of standards that defines new protocols for flexible information transfer using several interconnection topologies. Fibre Channel technology can be used to connect large amounts of disk storage to a server or cluster of servers. Compared to Small Computer Systems Interface (SCSI), Fibre Channel technology supports greater performance, scalability, availability, and distance for attaching storage systems to network servers.
Fibre Channel Arbitrated Loop (FC-AL) is a loop architecture as opposed to a bus architecture like SCSI. FC-AL is a serial interface, where data and control signals pass along a single path rather than moving in parallel across multiple conductors as is the case with SCSI. Serial interfaces have many advantages including: increased reliability due to point-to-point use in communications; dual-porting capability, so data can be transferred over two independent data paths, enhancing speed and reliability; and simplified cabling and increased connectivity which are important in multi-drive environments. As a direct disk attachment interface, FC-AL has greatly enhanced I/O performance.
Devices are connected to a FC-AL using hardware which is termed a “port.” A device which has connections for two loops has two ports or is “dual-ported.”
In one embodiment, the operation of FC-AL involves a number of ports connected such that each port's transmitter is connected to the next port's receiver, and so on, forming a loop. Each port's receiver has an elasticity buffer that captures the incoming FC-AL frame or words and is then used to regenerate the FC-AL word as it is retransmitted. This buffer exists to deal with slight clocking variations that occur. Each port receives a word, and then transmits that word to the next port, unless the port itself is the destination of that word, in which case it is consumed. The nature of FC-AL is therefore such that each intermediate port between the originating port and the destination port gets to ‘see’ each word as it passes around the FC-AL loop. There exist also well-known alternative embodiments, such as those using Fibre Channel switches instead of FC-bypassable transceivers.
FC-AL architecture may be in the form of a single loop. Often two independent loops are used to connect the same devices in the form of dual loops. The aim of these loops is to provide an alternative path to devices on a loop should one loop fail. A single fault should not cause both loops to fail simultaneously. More than two loops can also be used.
FC-AL devices typically have two ports allowing them to be attached to two FC-ALs. Thus, in a typical configuration, two independent loops exist and each device is physically connected to both loops. When the system is working optimally, there are two possible loops that can be used to access any dual-ported device.
A FC-AL can incorporate bypass circuits with the aim of making the FC-AL interface sufficiently robust to permit devices to be removed from the loop without interrupting throughput and sacrificing data integrity. If a disk drive fails, port bypass circuits attempt to route around the problem so all the other disk drives on the loop remain accessible. Without port bypass circuits a fault in any device will break the loop.
In dual loops, port bypass circuits are provided for each loop and these provide additional protection against faults. A device can be bypassed on one loop while remaining active on the dual loop.
A typical FC-AL may have one or two host bus adapters (HBA) and a set of six or so disk drive enclosures or drawers, each of which may contain a set of ten to sixteen disk drives. There is a physical cable connection between each enclosure and the HBA in the FC-AL. Also, there is a connection internal to the enclosure or drawer, between the cable connector and each disk drive in the enclosure or drawer, as well as other components within the enclosure or drawer, e.g. SES device (SCSI Enclosure Services node) or other enclosure services devices.
An SES device is an example of an enclosure service device which manages a disk enclosure and allows the monitoring of power and cooling in an enclosure. The SES device also obtains information as to which slots in an enclosure are occupied. The SES device accepts a limited set of SCSI commands. SCSI Enclosure Services are well-known to those skilled in the art and need not be described further here.
SES devices may be dedicated SES nodes on the loop or there may be a disk drive that also supports ESI communication to the enclosure processor. For the purposes of this disclosure, either type of device will be referred to as an SES device.
Having described the general background of the invention, a more detailed description of some particular problems typically frequently encountered by users of redundant data path subsystems.
In subsystems such as FC-AL that contain redundant data paths, when it is necessary to add another enclosure concurrently with normal subsystem operations, there is always a possibility that the new enclosure that is being added has an internal failure. If such an enclosure is attached, both of the interfaces might be disabled or rendered dysfunctional when the new enclosure is connected to the existing, functioning subsystem. The points of failure may lie in the existing subsystem's interfaces, or in the newly-attached enclosure's interfaces.
It is known in the art, for example, from U.S. Pat. No. 5,890,214, to address a similar problem by requesting, over a separate (non data-path) channel, that a device return its own status. However, this only allows the device to return the status of which it is “aware”, which may be incorrect. Also, the status of the interfaces is not thereby tested, as a completely separate communications channel has been used to request and receive the status.
It would therefore be desirable to enable the automation of the attaching and checking process and to offer protection against the problems described, as well as reducing the potential for human error, without adding extra devices or channels.