In high-availability clustered software, a number of nodes collaborate to deliver a service to users. High-availability clustering requires the existence of a quantity of replicated state or metadata information that the software uses in order to deliver its features. For correct operation of the product it is critical that this cluster state be internally consistent. By internally consistent we mean that for example different layers in the software have the same count of the number of objects, and the like.
In high-availability clustered software, it is desirable to maintain 100% availability or as close to that target as possible. However, it is possible for software defects to exist in the code. Software error recovery procedures are available for high-availability clustered software. These error recovery procedures allow the cluster to recover if a software failure occurs. The error recovery procedures are designed to ensure that the internal state remains consistent. Unfortunately, software errors can result in inconsistencies in cluster state which can in turn provoke further software failures. These software errors are sometimes only discovered when another failure occurs and this can lead to extensive, and expensive, downtime in production environments. In order to fix these software errors when they occur, fixes are applied to patch the identified error in the cluster state. However, it is not possible to guarantee that there are no further, undiscovered bugs in the cluster state.
In order to guarantee that there are no further inconsistencies as a result of a previous cluster recovery, a reinstall of the storage virtualization software and a re-initialisation of the cluster state to its initial conditions are necessary, but this is a disruptive procedure. One option for a reinstall is to run a Tier 3 recovery procedure (restoration of data from archive storage) which again is disruptive. Another option is to build a new cluster and configure it identically to the original cluster. The data would have to be transferred (by using, for example, host mirroring). In some systems this can be done without stopping I/O. The disadvantage of this solution is that it is expensive: additional hardware is required (twice as many nodes, additional storage) and it requires considerable resource use to migrate to the new cluster. The introduction of new hardware also introduces the risk of hardware faults that potentially compound the problem.
It would thus be desirable to have a technological means for recovering from errors in high-availability clustered software, in a manner which is non-disruptive and which is not dependent on additional hardware and resources in form of systems and storage administrator time and effort.