Large scale distributed systems are becoming increasingly popular for use in applications that require a high level of availability and/or processing capability. Popular uses for distributed systems include search engines and other online applications. Typically, such distributed systems include hundreds to hundreds of thousands of computing devices each executing one or more processes. As with any computing device, each of the computing devices executing within the distributed system may fail. However, because of the large scale of the systems, correcting such failures in a rapid and economically feasible way may be difficult.
One solution to the failure of computing devices is known as repair services. Typically the repair services monitor the computing devices of the distributed system for failures and take one or more repair actions based on any detected failures according to a policy. For example, if the repair service determines that a computing device is not responsive, then the policy may dictate that the computing device be rebooted. While such repair services are effective, it is complex and expensive to (a) measure the effectiveness of a particular policy or repair action in the distributed system and (b) determine and adjust the accuracy of the sensors.