Clustered storage systems can include a potentially large number of distributed nodes interconnected by a mesh network or other type of communication network. A given such node of a clustered storage system typically runs processes that involve interaction of that node with one or more other nodes. If a process should fail, due to a software error or other type of failure, it will generally need to be restarted. However, there may be one or more commands of the failed process that were issued to other nodes prior to the failure but have not yet completed their execution on those other nodes. Such “in-flight” commands can cause problems in the clustered storage system by interfering with the restarted process. Many conventional systems are therefore configured to wait until all in-flight commands of the failed process are completed before restarting the process. Unfortunately, these and other conventional approaches can result in excessively long wait times before restart, which significantly undermines system performance. In some cases, the wait times can even extend beyond system-defined timeout limits, leading to the node being designated as a failed node, thereby further undermining system performance as well as system redundancy.