The present invention relates generally to automating operations in accordance with performance management systems and, more specifically, to facilitating the specification of exceptional conditions in such systems.
Approximately 80% of the cost of networked systems can be attributed to management and operations. These costs are related to activities such as distributing and installing software, providing help desk support, and detecting performance and availability problems.
Generally, existing practice for detecting performance and availability problems consists of the following steps: (i) determining a set of metrics to monitor that indicate the presence of problems (e.g., CPU utilization, error rates, transaction request rates); (ii) establishing thresholds on the values of these metrics based on past experience; (iii) using management system software to detect threshold violations; and (iv) responding to threshold violations by taking actions (e.g., adjusting user priorities, restricting the admission of traffic into the network).
So fundamental are these steps to existing practice that the information that drives them is typically externalized as management.xe2x80x9cpolicies.xe2x80x9d A policy consists of a metric (or function of multiple metrics), a relational operator that specifies the direction of change in the metric that is undesirable, a threshold value, and an action to take when the threshold is violated.
Typically, policies are expressed as if-then rules. The if-part (or left-hand side, LHS, of the policy) contains a predicate expressed as a bound on one or more metrics. The then-part (or right-hand side, RHS, of the policy) contains the action to take. An example is: xe2x80x9cIf CPU utilization is greater than 90%, then alarm.xe2x80x9d Here, xe2x80x9cCPU utilizationxe2x80x9d is the metric, the relational operator is xe2x80x9cgreater than,xe2x80x9d the threshold value is xe2x80x9c90%,xe2x80x9d and the action is xe2x80x9calarmxe2x80x9d (e.g., send an urgent message to the operations console). The threshold value for alarms may be chosen so that it lies well beyond what is considered normal. We use the term xe2x80x9calarm thresholdxe2x80x9d for the metric value that, if exceeded, results in either the generation of an alarm or a management action (e.g., terminate a process). Existing approaches check for threshold violations and, when these violations occur, initiate the action specified in the right-hand side of the policy.
In practice, policies have another aspect as well. In order to eliminate transients, it is often the case that the right-hand side of a policy is executed only after the left-hand side of the policy has been satisfied for several successive time intervals. Thus, a common version of the foregoing example is: xe2x80x9cIf CPU utilization is greater than 90% for three successive time intervals, then alarm.xe2x80x9d Thus, embedded within the left-hand side of policies in existing art are higher level policies that determine when the right-hand side should be executed. An example of such a higher level policy is xe2x80x9cfor three successive time intervals,xe2x80x9d as in the foregoing example.
Existing art provides for policy authoring and execution. That is, administrators typically have a graphical user interface through which they specify policy metrics, threshold values, relational operators, and actions. The management system acquires the data necessary to test the left-hand side of a policy and to execute the right-hand side of a policy.
In order to author policies, administrators must specify one or more values for alarm thresholds (e.g., 90% CPU utilization). Doing so can be quite burdensome since the appropriate choice for an alarm threshold depends on factors such as configuration and workload. To complicate matters, workloads are time varying and so the appropriate choice of threshold values is time varying as well.
Researchers have tried to address these difficulties by: (i) computing threshold values from historical data (e.g., J. Buzen and A. Shum, xe2x80x9cMASF-Multivariate Adaptive Statistical Filtering,xe2x80x9d Proceedings of the Computer Measurement Group, pp. 1-10, 1995; and L. Ho et al., xe2x80x9cAdaptive Network/Service Fault Detection in Transaction-Oriented Wide Area Networks,xe2x80x9d Integrated Network Management VI, edited by M. Sloman et al., IEEE Publishing, 1999); and (b) developing multivariate tests for network-based problems (e.g., M. Thottan and C. Ji, xe2x80x9cFault Prediction at the Network Layer Using Intelligent Agents,xe2x80x9d Integrated Network Management VI, edited by M. Sloman et al., IEEE Publishing, 1999), and automating the updates of threshold values (e.g., Ho et al. article).
Even so, existing art is deficient in two respects: (1) there is no mechanism for automated adaptation of alarm thresholds tested by agents on managed elements; and (2) higher level policies are embedded within the left-hand sides of the policies in existing art and hence changing these policies often requires extensive modifications to the management automation. Note that item (1) requires more than the distribution of new threshold values (e.g., as in Ho et al.). It also requires a means to determine when threshold values should be changed.
In addition to the foregoing, existing art is deficient in the manner in which xe2x80x9cwarning policiesxe2x80x9d are handled. Warning policies provide advanced notice of alarm situations so that management staff can detect problems before they lead to widespread disruptions. In the existing art, warning policies are constructed manually by administrators. That is, administrators must specify a set of warning thresholds in addition to the alarm thresholds. Violating a warning threshold causes a message to be sent to the operations staff. Below is an example of a warning threshold for the previously introduced policy for CPU utilization: xe2x80x9cIf CPU utilization is greater than 80% for three successive time intervals, then warn.xe2x80x9d
In existing practice, warning thresholds are specified in the same manner as alarm thresholds. Thus, there is no insight as to when or if the alarm threshold will be violated once a warning threshold is violated. Further, administrators are burdened with specifying still more thresholds.
The present invention provides methods and apparatus that reduce the burden on administrators for performance management. The methods and apparatus use models of metric values to construct and enforce: (1) alarm policies that adjust automatically to changes, for example, in configuration, topology, and workload; and (2) warning policies based on the probability of violating an alarm policy within a time horizon.
It is to be appreciated that a performance management system of the present invention preferably utilizes forecasting models (e.g., analysis of variance and time series models) to capture non-stationarities (e.g., time-of-day variations) and time-serial dependencies. For example, as described in J. Hellerstein, F. Zhang, and P. Shahabuddin, xe2x80x9cAn Approach to Predictive Detection for Service Level Management,xe2x80x9d Integrated Network Management VI, edited by M. Sloman et al., IEEE Publishing, May 1999, the disclosure of which is incorporated herein by reference, a model in which S(i,j,k,l) is the value of a metric at time of day (i), day of week (j), month (k), and instance (l) may be employed in accordance with the invention. The model is: S(i,j,k,l)=mean+mean_tod(i)+mean_day-of-week(j)+mean_month(k)+e(i,j,k,l). Here, the terms beginning with xe2x80x9cmeanxe2x80x9d are constants that are estimated from the data, and e(i,j,k,l) are the residuals of S. These constants may be estimated using standard statistical techniques such as analysis of variance and least square regression, which are well known in the art. The residuals are identically distributed (stationary), but time serial dependencies may remain. To remove time serial dependencies, a second model may be used: e(t)=a1*e(txe2x88x921)+a2*e(txe2x88x922)+y(t), where a1 and a2 are constants estimated from the data and the y(t) are independent and identically distributed normal random variables. The y(t) are the result of removing the time serial dependencies. Advantageously, with these two models, it is possible to forecast future values of metrics.
In addition to using models of metric behavior, the invention also utilizes the idea of separation of different kinds of information. Sometimes referred to as explicit knowledge representation, this technology has been used widely in the construction of expert systems. We use explicit knowledge representation to isolate the conditions for triggering right-hand sides from the other elements of a policy.
The present invention preferably deals with two kinds of policies. The first kind of policies, referred to as xe2x80x9ccontrol policies,xe2x80x9d address the essence of the left-hand side of existing policies: identifying the metrics to be tested and the directional change in these metrics that constitutes an exceptional situation. The second kind of policies, referred to as xe2x80x9cmeta policies,xe2x80x9d determine how control policies are interpreted. Meta policies address the following: (a) the acceptable level of false alarms; (b) the choice of alarm threshold (based on metric models and false alarm policies); (c) when models should be reconstructed; and (d) when warnings are generated.
We begin with control policies. These consist of three parts: (a) a simplified left-hand side; (b) a right-hand side having a similar form as in the existing art; and (c) a class (or label) that identifies the relevant meta policies. The left-hand side contains: (1) a metric to check; and (2) an indicator of directional change for the metric. An example of a control policy is:
CLASS_FILE.SERVER: If CPU utilization is too large, then alarm.
Here, CLASS_FILE.SERVER is a label that identifies the meta policies that apply to the control policy. The left-hand side of the policy specifies the metric to be tested, i.e., xe2x80x9cCPU utilization,xe2x80x9d and the directional change that is considered excessive, i.e., xe2x80x9ctoo large.xe2x80x9d
We consider four kinds of meta policies. For each, there is a class description that specifies the control policies for which the meta policy applies. This description can be expressed in many ways including, for instance, the use of scope hierarchies and wild cards. For example, the class description CLASS_*.SERVER matches the class CLASS_FILE.SERVER, where the symbol * denotes the wild card position.
Considered first are false alarm policies. These policies are used in combination with the metric models to determine threshold values to test. False alarm policies are expressed as an upper limit on the probability of a false alarm. For example:
False alarm policy for CLASS_*.SERVER: Probability(false alarm) less than 0.05
False alarm policies are employed as follows. For each control policy that matches the class description CLASS_*.SERVER, the control policy""s metric is identified, and the associated metric model is located. This model is used to determine the metric""s 95th percentile for the time interval in which the control policy is being evaluated, which is used as the alarm threshold. The observed value of CPU utilization is then compared to this threshold.
Considered next are a trigger policies. A trigger policy specifies the circumstances under which a satisfied left-hand side of a policy will cause the right-hand side of the policy to execute. According to an embodiment of the invention, we specify the number of times that the left-hand side must be satisfied within a specified number of evaluations. For example:
Trigger policy for CLASS_*.SERVER: Execute right-hand side if the left-hand side is satisfied three times in three evaluations.
This trigger policy applies to all control polices with classes that have SERVER as the last component. The right-hand side of the policy will be executed if the left-hand side of the policy is satisfied three times in succession. A control policy is said to be xe2x80x9cenabledxe2x80x9d if it satisfies its trigger policy.
A third kind of meta policy addresses the conditions under which warnings are generated. There are two parts to this specification. The first indicates the time horizon in which a warning is desired. The second dictates how certain it must be that the alarm threshold will be violated. An example is:
Warning policy for CLASS_*.SERVER: Warn if the probability of violating the alarm threshold is greater than 0.7 in a 5 minute time horizon.
Note that there may be multiple warning policies for the same class of control policies, as long as the warning policies specify different time horizons. Further, observe that both warning and false alarm policies have a probability, although the semantics of this probability are different. Thus, from the point of view of information representation, a false alarm policy can be represented as a warning policy with a time horizon of zero.
The fourth meta policy dictates when the unknown constants in the metric model must be re-estimated based on the frequency of violation of alarm thresholds. This is specified in a manner similar to that used for trigger policies:
Model reconstruction policy for CLASS_*.SERVER: Reconstruct the model if the left-hand side is satisfied 8 times in 10 evaluations.
As will be illustrated in detail below, a performance management system of the invention may comprise a system manager to support the policies described herein which is responsible for one or more model-based policy agents that respectively reside on one or more managed elements. The system manager provides an interface for the administrator through which policies are authored and messages are reported. The model-based policy agent may preferably have the following components: an agent side policy controller, a model constructor, a threshold constructor, a control policy evaluator, and an action executor. In such an embodiment, the policy controller provides overall control of the agent. The model constructor estimates the unknown constants in the metric models that provide a way to predict future values and a method of determining the distribution of metric values for a particular system, e.g., time-of-day, etc. The threshold constructor uses the metric models and the false alarm policies to determine the value of alarm and warning thresholds. The control policy evaluator provides a mechanism for determining if the LHS of a policy is satisfied at a specific time. The action executor provides a means for executing the right-hand side of policies.
Specifically, an agent side policy controller according to the invention, may preferably perform the following operations: (i) invoke the model constructor for un-instantiated models; (ii) invoke the threshold constructor to determine metric thresholds in control policies based on policies for false alarms and warnings; (iii) invoke the control policy evaluator to determine which policies satisfy their left-hand sides; (iv) determine which control policies are enabled based on their trigger policies; (v) determine if model reconstruction is required based on model reconstruction policies; and (vi) invoke the action executor for enabled control policies.
The present invention provides many ways to reduce the burden of performance management. First, the invention eliminates the need for administrators to specify alarm thresholds. Instead, thresholds are computed dynamically based on historical data, metric models, and separately specified policies for false alarms and warnings.
Second, the present invention separates trigger policies from control policies. In existing practice, trigger policies are embedded within the left-hand side of policy rules. Thus, changing a trigger policy may require administrators to modify many policy rules. By separating trigger and control policies, we simplify the control policies and provide an easy way for administrators to adjust the manner in which transient problems are addressed.
Third, our invention provides a systematic way to provide warnings. In existing practice, warnings are provided in an ad hoc and cumbersome way in that additional policies must be constructed and the relationship between warnings and alarms is unclear. In contrast, the present invention provides a method for constructing warning policies automatically, without specifying metric thresholds. This is done by having warning policies that express the minimum probability of a threshold violation for which administrators want to be notified of a possible alarm at a future time. Through our invention, the burden imposed on administrators is greatly reduced, and the semantics of warnings are more clear.
Fourth, our invention employs model reconstruction policies that determine when metric models should be reconstructed automatically (e.g., due to changes in workload, configuration, or other factors). Such policies provide a means for automated adaptation of the model-based approach to performance management.