An event is an output of a failure detection system to convey the occurrence of an error. Typically, system parameters of a monitored system are sampled periodically by a sensing agent. This can include, by way of example, the number of pages paged in or out since the last reboot available in Linux via the “proc” file system, consecutive values read X seconds apart used to compute the paging (in or out) rate, etc.
A post-processing rule can be defined on the basis of querying the sensing agent periodically. Such a rule can be, for example, a check of the sustained paging rate over a sequence of queries to see if it exceeds a threshold. A problem event is generated if the rule evaluates to TRUE. By way of example, the rule “≧400 pages/second for 5 consecutive queries,” if evaluated to TRUE, would result in a “SYSTEM_THRASHING” event. A resolved event is generated (the problem subsides) if a previously TRUE rule evaluates to FALSE.
An action is defined herein as the corrective steps taken by an autonomic system to resolve a failure reported on an IT element. Event storms are the manifestation of an important class of abnormal behaviors in distributed systems. They occur when a large number of nodes throughout the system generate a set of events within a small period of time. By way of example and illustration, let M={m1, m2, . . . , mn} be the set of monitoring component types in an environment, and let X={x1, x2, . . . , xm} be the set of IT elements being monitored by a subset of M.
One or more monitoring systems may be configured to alert for the same event or failure. Let F={f1, f2 . . . fp} be the set of errors or failures types. Also, monitoring component mt can generate event ejk upon detection of error or failure fj. Let ejk be an event indicating a failure of type fj on IT element xk, where ejk=1 (if failure is reported) or 0 (if failure has been resolved). Let A={a1, a2 . . . ao} be the set of actions taken on the monitored end points to resolve an error.
However, given the occurrence of an event storm in a distributed system, challenges exist in responding to the event storm in real-time, at a level higher than that of individual failures, so as to have reduced interference in the system.