This invention relates to the detection of equipment failures on a disk array connected over a loop such as a Fibre Channel loop.
Subsystems comprising disk arrays, i.e., groups of small, independent disk drive modules used to store large quantities of data have been developed and found to possess many advantages over a single large disk drive. For example, the individual modules of a disk array typically take up very little space and typically use less power and cost less than a single large disk drive, yet, when grouped together in an array, provide the same data storage capacity as a single large disk drive. In addition, the small disks of an array retrieve data more quickly than does a single large disk drive because, with a small disk drive, there is less distance for the actuator to travel and less data per individual disk to search through. The greatest advantage to small disk drives, however, is the boost they give to I/O performance when configured as a disk array subsystem.
On a disk array system, a failure in any one disk drive in the array will require identifying which of the many disk drives in the array was the cause of the problem. With the advent of communication loops for connecting the disk drives of an array, the need to identify and remove a faulty drive becomes particularly desirable as communications are being passed through the receiver and transmitter of each disk drive on the loop. Arbitrated loop protocols such as Fibre Channel are becoming popular for providing high speed communications in a disk array. A difficulty one may run into on a Fibre Channel disk array system is that while the standard failure identification is by target, in this case the logical unit (LUN) that received the I/O request from the host, any non-target drive in the Fibre loop may have actually perpetrated the error. More generally, in sending a data word from a host to one of the disks on the loop, the word must pass through receivers and transmitters in each of the disk drives electrically between the host and the target disk drive on the loop. If an error in the word is caused by any of the receivers or transmitters along the way, an error is reported by the target disk drive. While the system is aware of the error, it typically is not able to determine which of the disk drives on the loop was the cause of the error. Trial and error diagnostics need to be implemented in order to locate the faulty equipment.
Requests are made to each of the disk drives on a loop of disk drives for a count of errors so that an increase in the number of errors may be detected and reported. Detection of an invalid transmission word can take place at intermediate disk drives between an initiator sending the data word and the target drive. As such, detection of occurrences of an invalid transmission word can be used to identify faulty equipment, either receivers or transmitters, in disk drives that are located on a loop.
A loop of disk drives, such as a Fibre Channel loop, typically permit disk drives to initiate a loop initialization protocol (LIP). The loop initialization protocols are typically initiated upon adding a disk drive to a loop, upon power up or for error recovery. In order to assist and properly identify failed equipment on a loop of disk drives, a count of LIPS initiated and received by each disk drive is requested from the disk drives on the loop. The occurrences of LIP initiations and LIP receptions are synchronized with disk drive error requests and compared to identify disparities indicative of a failure on the loop. Also, any initiation of a certain type of LIP, which we shall refer to as an xe2x80x9cerror-indicating LIPxe2x80x9d, is indicative of a failure in a disk drive or possibly its electrical predecessor. In accordance with a particular embodiment of the invention, initiation of any LIP by a disk drive is indicative of a failure in a disk drive or its electrical predecessor on the loop. Furthermore, when LIP receptions are identified at disk drives on a loop but no corresponding LIP initiation is identified, the equipment failure might not be a disk drive, but rather from other equipment on the loop such as a host bus adapter.
In accordance with an embodiment of the invention, the error count may include both the amount of invalid transmission words and the number of loop initialization protocols initiated and received. All such counts may be requested over the Fibre Channel loop from each disk drive on the loop. The baseline count is achieved in a first request. A second request for the counts permits the detection of changes in the counts on the disk drives in the loop. If no LIPs have occurred, the change in error count is used to identify a suspect disk drive. Also, the electrical predecessor on the loop is recorded since an error may have been caused by the transmitter of the predecessor or the receiver of the error detecting disk drive. When LIPs are detected, they are used to help locate the source of the errors. The methods of the present invention may be embodied on a computer program product for use on a computer system.
Embodiments of the invention advantageously achieve early and quick detection of failed equipment on the loop. The LIP counts may advantageously identify a non-disk drive error and thus save the time and effort in doing a trial and error diagnostic at each of the disk drives in the loop. Furthermore, by making the initiation of any LIP indicative of an equipment failure in a disk drive or its electrical predecessor, earlier detection of failed equipment is made possible.
Other objects and advantages of the invention will become apparent during the following description of the presently preferred embodiments of the invention taken in conjunction with the drawings.