Storage systems are expected to be highly available and highly reliable. Many storage systems use some form of erasure coding, including RAID (redundant array of independent disks), allowing for failure of one or two storage devices (depending on implementation). Yet, individual storage devices can fail intermittently or permanently, or even power up in a failed state. One or two failed storage devices can reduce a storage system to a critical level beyond which a device failure results in loss of data. When storage devices are failing, it could be risky to perform diagnostics on yet another storage device, or remove and replace a storage device with an upgraded storage device. Many storage devices have a heartbeat function that can be monitored, to detect failure. It is possible in a distributed storage system that two or more processes could conflict over a storage device, making diagnostics based on heartbeat risky. And, diagnosing a storage device, servicing or replacing a storage device, and even performing routine maintenance on a storage device can disrupt storage operations in storage systems, which is undesirable. Therefore, there is a need in the art for a solution which overcomes the drawbacks described above.