Recovery from computer failures due to software or hardware problems is a significant part of managing large data centers. Sometimes the failures may be automatically fixed through actions such as rebooting or re-imaging of the computer.
In large environments, it is prohibitively expensive to have a technician decide on a repair action for each observed problem. As a result, the data centers often employ recovery systems that use some automatic repair policy or controller to choose appropriate repair actions. Typically the repair policy/controller are manually defined and created by a human expert. More particularly, the expert creates policies that map the state of the system to recovery actions by specifying a set of rules or conditions under which an action is to be taken.
However, such policies/controllers are almost never optimal, in that even though they often fix an error, the error may actually be fixed in a faster (and thus correspondingly less expensive) way. As such, these policies may result in longer failure periods (system downtime) than needed.