Embodiments of the present disclosure generally relate to a storage technology, and more specifically, relate to a method and an apparatus for failure detection in a storage system.
In some kind of storage systems, hardware storage devices such as disks and cabinets may be connected together by a plurality of switches so as to form a storage network. In such a storage network, data input/output (I/O) paths may involve a plurality of switches. It is known that when a software module in a storage system fails, a failure source may be relatively easily located through various software analysis and reproduction technologies.
Specifically, during running of a storage system, a switch may fail due to equipment aging, power supply issues (e.g., voltage instability), environment factor (e.g., temperature, humidity, etc.). In this case, a data I/O operation error in the storage system may be observed, for example, data format loss, check error, etc. In this case, a traditional solution needs to check all switches that likely cause the error in the I/O path one by one, which is time-consuming and troublesome.
Some known solutions perform failure detection using a checking technology. If check error occurs to data received by one switch in the I/O path, an upstream switch sending the data to the switch is determined as a failing device. However, this method has an accuracy deficiency. It would be appreciated that occurrence of a check error does not necessarily mean that the switch fails. In many cases, a check error might be caused by a software module, a link, or even some random or unknown reasons. Additionally, when a plurality of switches in the I/O path detects a check error of incoming data, the traditional method will decide that all of these switches are failing devices. However, this is often not the case.