Many systems employ monitoring software that trigger alarms when the systems fail to operate within predetermined bounds. Typically, the software tracks performance of technical metrics and triggers an alarm, such as a message, a warning, or other indication, when a threshold value is reached or exceeded. The technical metrics often include hardware activity, such as processor load, memory usage, network and resource access. However, other technical metrics such as response time (latency) may be used. These metrics are often used as generic technical metrics regardless of a prescribed operation of a system. For example, a system used to control power in a hospital may include similar technical metrics as another, less critical, system used to support casual gaming via a network communication. Some systems, such as the casual gaming system, may have minimal consequence if the system fails on occasion, except the possible inconvenience of some users. However, it may not be acceptable for the system that controls power in the hospital to fail, or to fail often, since the consequences may involve human lives.
Another problem with use of technical metrics is that they often trigger many false positives, which falsely indicate that the system is not operating properly. Instead, the system may be operating properly or as-expected, but rather the system may be experiencing a larger number of requests than usual, for example. Although false positives may be acceptable for system performance, the false positives can be expensive because they often result in downstream processes that may include increased human interaction, throttling services, or reallocating other computing resources.