Servers within a distributed network, such as a managed information technology (IT) environment, perform transactions with other servers and use resources within the system. As the servers require the use of other servers and resources, the security, operability and reliability of the servers become more important. If a server fails or has its security breached it may affect other servers and resources that were tied up in transactions with the server at the time of the server's failure. Whether a server has failed completely or the server's condition has degraded is important information to a distributed network. Thus, it is important to know the health status of each server in order to maintain the security and operability of each server.
Typically, in a distributed network, every server is health checked for vulnerabilities on a regular basis. The health checking process conventionally includes a mechanism for polling each active server with a query or script on a periodic basis. The query or script returns results indicating such things as whether a server of the distributed network is operating, whether aspects of the server are operational, and the like. Based on the results of the query or script, the malfunctioning or at risk servers can be remediated, and the server put back online once they regain operability or secure status. However, this process of checking every server for health or vulnerabilities requires significant effort and time to periodically perform the health checks and analyze the results.
One approach to overcome this labor intensive process is an automated check system having a server manager that is configured to determine if a health check is required for a particular server based on one or more predefined policies. For example, a health check may be triggered by a policy that requires a health check be performed after a period of time has elapsed. In this case, the expiration of the period of time specified by an interval parameter will trigger a health check for a server. On the other hand, if a health check is not triggered, the automated check system continues in a standby state waiting for a triggering event defined by a policy to occur regarding a particular server.
However, such a policy based approach does not consider the actual risk of a server being unhealthy or vulnerable. Instead, such an approach utilizes generic triggers, such as an expiration of a period of time or capacity of a hard drive, in order to perform health checking in a routine manner without deference to the actual risk of a server being unhealthy or vulnerable. In such an approach, health checks are still performed on servers at “low risk” for being unhealthy or vulnerable. Execution of these health checks on “low risk” servers takes away effort from incident resolution, project implementation, and new business opportunities. Further, and potentially worse, a server at high risk of failure may not be checked at all, because not enough time has elapsed to trigger a health check, leaving the system vulnerable to a potentially catastrophic failure or security breach.