In order to provide for high-throughput of work, or nearly continuous availability, distributed computing systems are often utilized. A distributed computing system typically includes two or more computing devices which frequently operate somewhat autonomously and communicate with each other over a network or other communication path. A computing device of a distributed system that has the capability of sharing resources is often referred to as a cluster which has two or more processor nodes, each node having a processor or at least a processor resource, and typically, a separate operating system.
Certain tasks are frequently correlated amongst the processor nodes of the distributed computing system. For example, a distributed computing system may be reconfigured from time to time as computing needs change, or devices or software is added, upgraded or replaced. Such reconfiguration is often controlled at a processor node of the distributed computing system. As a particular processor node reconfigures the distributed computing system, the other processor nodes of the distributed computing system are kept informed via communication links over which the various processor nodes communicate with each other.
The processor node performing the reconfiguration as well as each processor node upon being informed of a system reconfiguration, typically automatically performs various housekeeping tasks in response to the reconfiguration including clean-up and database update tasks to reflect the changes in the system. Thus, a reconfiguration operation is an example of a multi-node correlated operation in which various tasks related to the operation are correlated amongst the processor nodes of the distributed computing system. However, due to communication link failures or a processor node going offline, the reconfiguration may occur while one or more processor nodes are offline or otherwise out of communication with the processor node conducting the reconfiguration. As a result, the missing processor nodes may not have performed the housekeeping tasks associated with the reconfiguration. In such cases, when the missing processor node comes back online or communication is otherwise reestablished, the system operator typically manually controls the returning processor node to perform the housekeeping tasks associated with the reconfiguration that occurred while the returning processor node was inactive or out of communication with the other nodes of the system.