In known distributed software applications, software components are distributed among a plurality of executables (i.e., software capsules or software entities). Each of the executables contains one or more software components that perform some portion of the functionality of the distributed software application. The software applications may comprise management software and in one example backup copies of the management software. If the management software fails, then one of the backup copies of the management software becomes active and substitutes for the failed management software.
Prior to a failure of the management software of the software application, the management software performs frequent periodic check pointing of its view of the state of all software components of the software application. The management software stores the state of all software components to stable storage. Upon a failure of the management software, the backup management software component accesses the stored state information and assumes that the most recent copy of the state information is correct. The backup management software may use state information of the software application acquired prior to the failure. As one shortcoming, the state information may have changed since the last check pointing of the state information before the failure.
To save the state of the software components in case of a failure, the management software and/or the software components are continuously interrupted during normal operation. The frequent periodic check pointing of the state of all software components wastes time and resources of the management software and/or the software components. As another shortcoming, the time and resources used to save the state of the software components could otherwise be used for operation of the software application.
Thus, a need exists for management software that obtains accurate state information of a software application for use during recovery from failure. A further need exists for a reduction in a number of interruptions of management software and/or software components to obtain the state information for recovery.