1. Field of the Invention
The present invention relates to a method, system, and article of manufacture for synchronizing device error information among nodes.
2. Description of the Related Art
Host systems in a storage network may communicate with a storage controller through multiple paths. The storage controller may be comprised of separate storage clusters or nodes, where each storage cluster is capable of accessing the storage and provide redundancy to access the storage. Hosts may access the attached storage through either cluster. If a storage cluster fails, then the host may failover to using the other storage cluster to access the storage.
In redundant storage controller environments, it is common for each storage node or cluster to establish ownership of certain external resources, such as network and Input/Output device adaptors. If a node in the system fails, other nodes in the system can take ownership of the resources that were owned by the failing node. If an external resource in the system starts reporting errors, the owning node will begin thresholding these errors and taking appropriate system recovery actions based on the number of detected errors. If, during this process, the owning node fails, another available node takes ownership of the external resource, but may have no knowledge of the previous errors that were recorded by the failing node. This causes the new owning node to treat the next error on the external resource as if it were the first error.
Further, if the multiple errors reported by the external resource somehow caused the previous owning node to fail, then the new owning node will go through the same actions as the previous node, which could result in the new owning node failing in the same way. If other nodes in the system continue to take ownership of the resource, it could result in all nodes failing, causing the customer to lose access to data. Restarting the recovery operation from a zero error count may cause the overall system recovery (taken by the previous owning node and the new owning node) to take long enough such that the host system times out and the customer loses access to data.
There is a need in the art for improved techniques to maintain error information for shared devices accessed by multiple nodes.