Operations monitoring systems often applies monitoring rules against streams of events generated in the course of operations. The stream of events is used to evaluate or characterize operations, such as whether operations are proceeding normally, or whether one or more problems are occurring. One of the key metrics to measure efficiency of such monitoring systems is the shortness of the Mean Time to Mitigate (MTTM). MTTM refers to the mean time measured from the moment a problem appeared the first time to the time the problem is mitigated. MTTM relies on a metric called Time to Detect (TTD), which is the time from when the problem first appeared until the time that the problem was detected. After all, a course of action for mitigating a problem cannot be initiated until the problem itself is identified.
Accordingly, low latency problem detection solutions have been developed in such monitoring systems. One way to provide low latency is by offloading local event processing on agent machines, while leaving cross component, aggregation and other higher level processing to central management servers. This solution works fine with applications deployed on a single machine, when a local agent can cover the monitoring needs for a given application.