Conventional fault tolerance for processes typically use some type of heart beat communication between two servers. In this manner if a first process fails on a first server (or if the entire first server fails), a second server will recognize that the heartbeat from the first server has stopped. The second server will then start up another instance of the process. However, for recovery of event-driven services, which run within the server process, the heartbeat mechanism alone is not sufficient. For example, modeling a process as a shell, the real logic are the individual threads of execution, i.e., services, which run within the shell. Thus, it is important to insure that when this failover occurs, not only is the process recovered, but also all services hosted in the failed process are restarted from a last known good state. The conventional heartbeat mechanism is necessary to detect the failure of the process and enabling another eligible process on a different server to execute the service, but it is unable to automatically restart the services of the failed process in the new server from the last known good state.
Therefore, there is a need for addressing not only the restart of the actual process shell, but all of the services which were running in that process shell in a different server. This is also true of instances where a process is shut down in a controlled manner, but the services running in the process are still driving and need to be restarted in a second server. The present invention provides solutions to these and other limitations in the prior art.