When performing system health monitoring or other similar operations or processes that attempt to achieve a comprehensive status or “big-picture” of large-scale systems, it is often difficult to move from a collection of measurements to a useful view of the system as a whole. One conventional approach to obtaining measurements for large systems is through what is sometimes referred to as “black-box monitoring” (“BBM”). When utilizing BBM, no knowledge of the internal state of the overall system is used to help measure availability of various components of the system. Instead the system is used as a customer (hereinafter referred to as a “user”) would use it, and the response evaluated to generate a signal.
An example of this conventional approach for monitoring large-scale systems is shown in FIG. 1, and includes the use of many distributed sensors that perform individual tests and produce a binary value (e.g., up/down, good/bad, OK/Not-OK, etc.) as output. Testing output such as this can be very noisy in complex systems where an individual component failing does not necessarily mean that the service is actually down. The output generated by these many sensors is aggregated and then rules designed to examine the percentage of successes-to-failures are used to compare against one or more thresholds.