In a typical network a plurality of servers are linked via a switch to block storage. The servers run different applications or service different clients and have exclusive access to the block storage data for those clients or applications.
Servers may be arranged in pairs or clusters that are ‘resilient’ i.e., are aware of the status or operation of the other servers and can take over from one another in the event of failure. When such resilience operates it is essential that only one server attempts to access the data to avoid corruption. Therefore when failure of a server is detected and its functions assumed by another server, it is usual for the failed server to be powered down, and effectively permanently removed from the cluster.
Although systems continue to function without the failed server, there are instances where the failure may potentially be temporary or recoverable, but as the failed server is powered down this cannot be detected. It would be more efficient if temporary or recoverable failures did not result in permanent removal of a server from active functioning.