The present invention relates to a technique for enhancing the reliability and robustness of a cluster.
A cluster consists of one or more hosts and one or more I/O (input/output) units containing I/O controllers, the hosts and I/O controllers being connected together by a common interconnection fabric. One common problem in a cluster arrangement is that there often are single points of failure in the cluster. While adding redundant elements to the cluster may eliminate some single points of failure, unfortunately, some present operating systems do not give a cluster an opportunity to recover from an element failure by substituting a redundant element. Furthermore, a failure in the cluster may be either transient or permanent. Unfortunately, present-day arrangements do not enable a cluster to automatically recover from a transient failure.