The invention relates generally to information technology (IT), and relates more particularly to ensuring the dependability of IT environments.
IT environments are relied upon for business-critical functionality, and are thus designed to provide so-called dependable service. A dependable IT environment is one that tolerates and recovers from unexpected conditions in the environment, such as component failures, software bugs, unanticipated load patterns, malicious attacks, correlated failure, human operator error and the like.
IT environments are typically designed for dependability under the assumption that the IT environment and its dependability-ensuring mechanisms are in good working order. In practice, however, an IT environment will frequently encounter circumstances in which this assumption does not hold, and dependability is compromised. This may occur, for example, due to conditions that were not anticipated by the IT environment's architects or due to the accumulation of latent problems (e.g., failures, misconfigurations, corruptions, etc.) that do not on their own cause loss of dependability, but do reduce the readiness of the IT environment to handle future unexpected conditions. For example, the failure of a backup node in a clustered computing environment may not, on its own, affect functionality or performance (since the backup node is not used in normal operations). But the failure will make the computing environment vulnerable, because additional node failures cannot be tolerated.
Thus, there is a need for a method and an apparatus for detecting dependability vulnerabilities in production IT environments.