Software systems may be subject to partial failure, violation of an established service-level agreement (SLA), or unexpected response to workload. Recovery from such failures, violations, or unexpected responses can include, for example, rebooting a system, or further expert analysis if rebooting is insufficient. For example, in order to determine the cause of a failure, an expert may need to manually evaluate a series of events to track down the cause of the failure. Once the cause of the failure is detected, a recovery mechanism may be applied to correct the failure. These processes can be time-consuming and complex, for example, based on the complexities of the software, the cause of the failure, and the complexities of the recovery mechanism.