A Service Level Agreement (SLA) is an agreement between a user and a service provider, defining the nature of the service provided and establishing a set of metrics (measurements) to be used to measure the level of service provided, measured against the agreed level of service. Such service levels might include provisioning (when the service is meant to be up and running), average availability, restoration times for outages, availability, average and maximum periods of outage, average and maximum response times, latency, delivery rates (e.g. average and minimum throughput), and others. The SLA also typically establishes trouble-reporting procedures, escalation procedures, and penalties for not meeting the level of service demanded—typically refunds to the user.
Various root-cause analysis methods and event correlation technologies have been developed for the purpose of monitoring failures of SLAs. Service Level Management (SLM) is a suite of software tools that provide both the end user organization and the service provider a means of managing the committed service levels defined in a SLA. SLM includes monitoring and gathering performance data, analyzing that data against committed performance levels, taking the appropriate actions to resolve discrepancies between committed and actual performance levels, and trending and reporting. SLM is difficult, especially across a wide range of complex technologies (i.e., Frame Relay and ATM) in a multi-site enterprise.
SLM typically deals with at least the following five fundamental issues:
1. Service Metric Selection: Monitoring service level metrics requires both human and machine resources. Monitoring designers generally lack the ability to choose a set of metrics that is minimal and sufficiently effective. One way metric selection can be done is by removing redundant metrics that contain information that can be inferred. As with any data-driven methodology, inference or induction can only be made on entities that have previously been observed. Therefore, the selection of metrics to be monitored is actually a reduction of metrics that have already been monitored.2. Service Breach Point Selection: An important part of an SLA is the thresholds that separate unacceptable service quality from acceptable service quality. Setting breach values is usually regarded as a subjective or even political matter. Nevertheless, historical data can provide invaluable insight in understanding the existing system capacity and help users to make educated decisions.3. Resource Metric Selection: A “resource” is any element of a computing system or operating system required by a job or task, including memory, input/output devices, processing units, data files, and control or processing programs. The number of resource metrics is usually at least a magnitude higher than the number of service metrics. Therefore, reducing the number of resource metrics to monitor can significantly lower the cost. As the information infrastructures become extremely complex, it is advantageous to discover the critical resources that support a particular service in terms of their performance dependency. Knowing the relationship enables the system administrators to better interpret the implication of changes in resource utilization. Additionally, the number of metrics to be monitored and managed can be further reduced.4. Monitoring Threshold Selection: In resource monitoring, alerts are usually generated when the metric values exceed or fall below certain thresholds. For example, an alert is generated when free disk space is less than 15% of the total disk space. However, there is no clear rule defining what the correct threshold values should be. However, the consequence of having non-optimal threshold values is either generating too many alerts or missing emerging service degradation. Unlike setting service breach points, resource monitoring threshold can only be objectively discovered.5. Bottleneck Resource Identification: Among all the IT resources that support a service, usually there are a few of them that can be called “bottleneck” resources because their metrics show stronger relevance to the service level. For example, a critical server may be equipped with an inadequate amount of memory. In this situation, a memory upgrade may significantly improve the service level. It is useful then, to identify the most likely bottleneck resources for both resource planning and monitoring purpose.
Time series metric analysis has been intensively studied in the past, especially in financial data analysis. This work can be regarded as an application of time-series data analysis. However, several intrinsic challenges have not been addressed adequately in the prior art. Examples of these are as follows.
1. Asynchronous data collection and irregular time series: In the application of managing distributed systems and applications, the data collection and monitoring are done in a distributed manner. That is, metrics collected from different devices may have very different sampling time and sampling durations. The classic algorithms can not handle such asynchronous time series directly.2. Relevance analysis: The classical correlation analysis of two time series typically assumes that the relationship of the two time series is linear and global (e.g., the correlation at a low value is the same as the correlation at a high value). This is not true for performance metrics of a computer device, which often experiences a non-linear relationship.3. Large volume: Many types of measurements can be obtained from a large number of data sources. For example, using Tivoli's ITM product, over 500 different resource metrics of an application server can be collected. It is quite common that a typical server farm consists of thousands of servers. This requires scalable algorithms in analyzing a large volume of temporal data in terms of both the large number of sampling points and the large number of types of measurements.
Currently there are many industrial products that handle business system monitoring and reporting, e.g. IBM Tivoli Business System Manager, IBM Tivoli Service Level Advisor, IBM Tivoli Monitor for Transaction Processing, BMC Patrol, etc. However, there is very little assistance or guidance that practitioners can get for business system monitoring designing. Therefore, traditional resource monitoring and event correlation have proven to be insufficient for understanding the overall service level.
Therefore a need exists to overcome the problems with the prior art as discussed above.