The present invention relates generally to the operations and management of networked systems. A particular version is related to acquiring measurements of computer and communications systems in distributed environments.
This invention relates to operations and management (OAM), such as considerations for security, performance, and availability. OAM accounts for 60% to 80% of the cost of owning network-connected information systems (according to leading analysts). These costs are expected to increase over the next several years due to the proliferation of networked applications and hand held devices, both of which make extensive use of services in distributed systems.
In an OAM system, the entities being controlled are called managed systems (also called agent systems). This control is typically exercised in part by software present on the managed systems. In addition, there are manager systems (also called managers) that are dedicated to OAM functions. Managers provide an environment for executing management applications (hereafter, applications) that provide functions such as detecting security intrusions, determining if managed systems are accessible, and responding to performance degradation.
A corner stone of OAM is measurement. Measurements include: (a) information on network activities that are suggestive of security intrusions; (b) the response times for xe2x80x9cpingxe2x80x9d messages sent to remote systems to determine if they are accessible; and (c) indicators of resource consumption that are used to diagnose quality of service problems. The term measurement acquisition protocol (MAP) is used to refer to a method of delivering measurements of managed systems to manager systems. A major concern with the proliferation of low-cost computing devices is developing scaleable MAPs. The present invention addresses this concern. It also addresses issues related to disconnected operations (which is increasingly common for low-powered devices) and synchronizing time stamps from multiple sources (which is problematic when systems have separate clocks, a situation that is common in practice).
A MAP allows one or more managers to access measurement variables collected on one or more managed systems. Examples of measurement variables are kernel CPU and user CPU as defined in the output of the UNIX(trademark) vmstat command. The value of a measurement variable at a specific time is called a data item. Data items have a time stamp that identifies when they were obtained from the managed system.
Prior art for MAPs includes: polling, subscription, and trap-directed polling. In polling (e.g., SNMP-based measurement acquisition), the manager periodically requests data from the managed system. Thus, acquiring N data items requires 2N messages.
Subscription-based approaches can reduce the number of messages required for measurement acquisition. Here, the manager sends a subscription request to the managed system. This request specifies how often the managed system sends values of the measurement variable to the manager. Thus, acquiring N data items requires on the order of N messages. While this is a considerable reduction compared to polling, a large number of messages are still exchanged.
Still more efficiencies can be obtained by using trap-directed polling (e.g., Tannenbaum, 1996). As with the previous approach, a subscription is sent from manager to managed systems. However, the managed system does not send a data message unless the variable changes value. This works well for variables that are relatively static, such as configuration information. However, this is equivalent to the subscription approach if variables change values frequently. Unfortunately, the latter is the case for many performance and availability variables, such as TCP bytes sent in IP stacks and the length of the run queue in UNIX systems.
Several techniques can improve the scalability of existing MAPs. However, none of these techniques effectively circumvents the scalability deficiencies of existing MAPs. One approach is to batch requests for multiple measurement variable into a single message. Replies can be batched in a similar way. Doing this reduces the number of messages exchanged to approximately N/B, where N is the number of data items and B is the number of data items in a batch.
While batching has merits, it has significant limitations as well. First, its benefits are modest if only a few variables are needed at a high sampling rate; that is, B is small and N is large. Second, batching can be done only for variables that are obtained from the same managed system. Thus, if there are a large number of systems from which only a few variables are needed, the benefits of batching are limited.
A second way to improve scalability is to poll less frequently, which reduces N. However, a long polling interval means that errant situations may go undetected for an extended period of time. Thus, installations are faced with the unpleasant choice of carefully managing a few systems or poorly managing a large number of systems.
A third approach to improving scalability is to report information only when an exceptional situation arises (e.g., Maxion, 1990). This approach is widely in practice. However, it has significant limitations. First, by its nature, exception checking requires that the managed system inform the manager when difficulties arise. This can be problematic if the managed system is so impaired that it cannot forward a message to the manager. A further issue with exception checking is that some exceptional situations involve interactions between multiple managed systems. Detecting these situations requires forwarding data to a manager on a regular basis.
In addition to scalability, existing MAPs have other shortcomings as well. First, existing MAPs do not support disconnected operation in which the manager cannot communicate with the managed system. Disconnected operation is common in low-end devices that operate in stand-alone mode (e.g., to provide personal calendaring services or note pad capabilities) so as to reduce power consumption. Unfortunately, existing MAPs require that managers be connected (possibly indirectly) to the managed system in order to obtain measurement data for that system.
A second issue in existing MAPs is their lack of support for integrating data from multiple systems and for combining data with different time granularities. Such capabilities are important in problem determination and isolation (e.g., Berry and Hellerstein, 1996). Unfortunately, integration is often impaired in practice since adjusting measurement data to account for the diverse interval durations used in the measurement collection requires a model of the time serial behavior of measurement variables. Such considerations are beyond the scope of current MAPs.
In summary, MAPs are a core technology in OAM. Existing art for MAPs is deficient in several respects. Current approaches scale poorly. They do not address disconnected operation. And, they do not help with integrating measurement data from multiple managed systems.
Predictive models have been applied in some management contexts. A commonly used class of predictive models are time series models (e.g., Box and Jenkins, 1976). Time series models have been applied directly to management problems, such as in Hood and Ji, 1997. An example of a time series model is
x(t)=a*x(txe2x88x921)+b*x(txe2x88x922),xe2x80x83xe2x80x83Eq (1) 
where x(t) is the value of the variable at time t, and a and b are constants that are estimated using standard techniques. For example, x(t) might be the average response time of transactions during time interval t. A more complex model might take into account other factors, such as the number of requests, denoted by y(t), and their service times, denoted by z(t):
x(t)=axe2x80x2*x(txe2x88x921)+bxe2x80x2*x(txe2x88x922)+c*y(t)+d*z(t).xe2x80x83xe2x80x83Eq (2) 
Even more sophisticated predictive models consider non-linear terms, such as powers of x, y, and z. As detailed in Box and Jenkins, 1976, time series models can forecast values for an arbitrary number of time units into the future (although the variance of the forecasts increases with the forecast horizon).
Models are also known in various other contexts, such as:
Compression schemes (e.g., Cover and Thomas, 1991) which reduce the data volumes sent between communicating computers by employing predictive models for data values.
Feedback control systems (e.g., Freeley et al., 1995) which employ predictive algorithms that anticipate data values, such as in a caching system.
Timer protocols (e.g., Mills, 1989) which coordinate distributed models of clocks to provide clock synchronization.
Schedulers for distributed systems (e.g., Litzkow, 1988) have a model of the systems being scheduled.
Schemes for providing approximate query results use statistical models to estimate these results (e.g., Hachem and Taylor, 1996).
None of the foregoing provide a method and a system whereby the managed system knows the values predicted by the manager for models that use historical data. None of the foregoing employs a method and a system for dynamically creating and deleting model specifications. Rather, existing art establishes model definitions when the system is designed. Further, in the existing art, updating models is restricted to changing their parameters. None of the foregoing provide for managing tentative updates (e.g., via heart-beat messages). The present invention addresses these needs.
Accordingly, the present invention is directed to an improved measurement acquisition system and method. In an application to distributed systems with properly enabled management applications, the present invention has features for: (1) reducing the volume of messages exchanged between managers and agents (also called managed systems); (2) addressing disconnected operation;, and (3) synchronizing time stamps. These benefits are provided by using predictive models that run in a coordinated manner on manager and managed systems.
The present invention has features which reduce the volume of messages exchanged between manager and managed systems. This technique is referred to as model-based measurement (MBM). In one example, MBM is accomplished by a method and a system that creates, uses, updates, and deletes predictive models in a manner that is coordinated between manager and managed systems. The method can be embodied as software, e.g., using well known object oriented programming technology, and stored on a program storage device for execution on a data processing system. As in subscription-based measurement acquisition protocols, the manager can send a subscription message to the managed system. In another example, the subscription may also specify an accuracy bound (e.g., a percent deviation from the actual value) for the predicted values. Agent software on the managed system then constructs a predictive model based on variable values on the managed system. This model is returned to the manager. The manager uses the predictive model to satisfy requests by management applications for values of the subscribed-to measurement variable. The managed system uses the predictive model to detect excessive deviations of predicted values from measured values. When this occurs, the agent software sends an updated model to the manager. Periodically, the managed system sends a xe2x80x9cheart-beatxe2x80x9d message to the manager. This message indicates which variables are confirmed to have predicted values that lie within the accuracy bounds specified by the manager.
In one example, values of measurement variables in the manager are kept in a measurement repository. These values have an associated status code that indicates how they were obtained. A value is tentative if it has been predicted but the managed system has not received a heart-beat message confirming that the prediction is within the accuracy bounds. A value is confirmed if such a message has been received. A value is actual if it was obtained from measurement facilities on the managed system. Here, management applications using data obtained with this version of the present invention must be adapted to handle these status codes. In particular, a tentative value may be changed if, through interactions between the manager and managed systems, it is subsequently determined that the data item is not within the range of accuracy desired by the manager. It is straightforward to provide a notification mechanism so that management applications are informed of such situations.
One example of a system in accordance with the present invention includes components on both manager and managed systems. One example of the components on the manager includes:
A plurality of management applications adapted to use predicted values and to handle measurement values with the above-mentioned status codes;
A measurement repository that stores measurement values, their status codes, and their time stamps;
A manager model handler that creates, updates, deletes, and uses predictive models of measurement variables; and
A manager protocol handler that provides overall coordination of MBM on the manager and exchanges messages with managed systems.
One example of the components on the managed system include:
An agent protocol handler that provides overall coordination of MBM on the managed system and exchanges messages with one or more managers;
An agent model handler that defines, updates, deletes, and uses predictive models on the managed system;
A plurality of agent data access facilities that provide actual values of measurement variables; and
An agent measurement repository that contains the measured values of subscribed-to variables that are known to the manager.
An example of a method having features of the present invention operates as follows. A management application interacts with the manager measurement repository to specify measurement variables for which a subscription is requested. The manager measurement repository notifies the manager protocol handler, which in turn sends a subscription message to the managed system. The subscription message specifies a desired accuracy. This message is received by the agent protocol handler. There is a period of time during which the managed system reports measured values to the manager. These values are recorded in the agent measurement repository to track the measured values known to the manager. Such tracking is necessary so that the agent model handler can produce the same estimates measurement variables as those produced by the manager protocol handler. Once sufficient data have been obtained, the agent model handler constructs a predictive model, such as by using well-known techniques for model identification and parameter estimation for time series data. The agent protocol handler then transmits this model, its parameters, and data inputs to the manager protocol handler, which in turn invokes the manager model handler create the model on the manager.
Next, the manager and managed systems may operate independently, possibly without any communications for an extended period. The manager protocol handler periodically updates the manager measurement repository using estimates obtained from the predictive model. The agent protocol handler periodically checks the accuracy of the predictive model. The agent connects to the manager only to send model updates and heart-beat messages.
Models constructed in this manner can be used to periodically update the measurement repository with values of the measurement variable. Such values have a status code of xe2x80x9ctentativexe2x80x9d. Periodic xe2x80x9cheart-beatxe2x80x9d messages sent from the managed system to the manager indicate variables for which data items are confirmed to have the desired accuracy (as specified in the manager""s subscription for the measurement variable). When such a confirmation is received for a value, its status code is changed from xe2x80x9ctentativexe2x80x9d to xe2x80x9cconfirmedxe2x80x9d.
For predictive models that accurately forecast the values of measurement variables, the foregoing can greatly reduce the volume of message traffic. However, there are some variables for which such models are unknown, at least in some operating circumstances. Further, it may be that certain changes in the components of a distributed system or their interconnections may cause the present invention to work poorly for a period of time. Thus, the present invention includes other features such that a measurement variable may have values obtained from a variety of MAPs that operate concurrently with MBM. Doing so requires having a third status code, actual, that indicates that the value was obtained from the measurement data access facilities on the managed system.
The present invention offers significant advantages over existing art. First, the invention provides improved scalability. In existing art, requests by management applications for non-static variables (e.g., counters such as the number of bytes sent on a TCP socket) require that a message be sent from the managed system to the manager. The overhead of these messages becomes increasingly burdensome as networks grow more complex. The present invention can greatly reduce network traffic and the associated performance issues (if the predictive models are sufficiently accurate). In particular, if the predictive model can forecast accurately values of measurement variables that are H time units in the future, then MBM only requires on the order of N/H messages to acquire N data items. (Heart-beat messages are considered to be a small fraction of the message exchange.) In contrast, existing MAPs require on the order of N messages (at least) if data values change frequently. Further, as with existing MAPs, MBM can employ batching of measurement variables. Doing so reduces the number of messages exchanged for MBM to N/(HB).
Second, the present invention offers a solution to managing palmtop and other low-end devices that often operate in disconnected mode. The challenge here is knowing about devices for which communication is possible only intermittently. Polling and subscription approaches are ineffective with disconnected devices. However, given sufficiently accurate predictive models, the present invention provides management applications with estimates of variable values. Doing so enables exception checking and health monitoring even if measurements of managed systems are not available.
Third, once a predictive model is available to the managed system, it can be used to aid in integrating data from multiple sources. For example, the present invention provides a way to synchronize data collected from multiple managed systems that have different collection frequencies and/or clocks that are not synchronized. Such considerations are particularly important in diagnostic situations and for operations consoles where a uniform perspective is essential. By using predictive models, the manager can adjust the granularity of the time stamp using techniques such as those in Priestly, 1981. In contrast, existing approaches to measurement acquisition provide little assistance with synchronizing data from multiple sources.
A specific and central problem is dealing with data that are collected at different frequencies. For example, resource data may be collected every fifteen minutes, but transaction data might be collected every minute. Commonly, such situations are addressed by aggregating data to the coarsest granularity. In this case, the transaction data are aggregated into fifteen minute intervals. However, with a predictive model, it is possible to interpolate values so that estimates of finer grain data can be obtained (e.g., using spectral techniques, as in Priestly, 1981).
To summarize, various versions of the present invention include the following characteristics:
Its context is the acquisition of measurement data in distributed systems.
It provides for creating, updating, and deleting predictive models in a manner that is coordinated between manager and managed systems.
It employs status codes for data items, wherein values supplied to management applications have a status code of xe2x80x9ctentativexe2x80x9d, xe2x80x9cconfirmedxe2x80x9d, or xe2x80x9cactualxe2x80x9d.
Other measurement acquisition protocols can be run concurrently with the present invention, and the same variable may use multiple measurement acquisition protocols in a manner that is transparent to the management application.
The present invention has still other features whereby the managed system knows the values predicted by the manager for models that use historical data. Such a capability is required in MBM so that the managed system knows the accuracy of the estimates produced by the manager. Providing this requires more than ensuring that both systems have the same predictive model. It also requires that both systems use the same input data to this model. One version of the present invention makes this possible by: (a) a system that incorporates an AgentMeasurementRepository component that stores variable values known to the manager and (b) a method that synchronizes data values in the ManagerMeasurementRepository with those in the AgentMeasurementRepository.
The present invention has yet other features for dynamically creating and deleting predictive model specifications, wherein such a specification includes a model definition (i.e., its algebraic form, such as Eq (1) vs. Eq. (2)), its parameters (e.g., a and b in Eq. (1)), and its inputs (which measurement variables are used). In contrast, existing art establishes model definitions when the system is designed. Further, in the prior art, methods of updating models are restricted to changing the their parameters.