For most corporations, services and applications are deployed by an internal IT organization on behalf of an internal customer. This relationship between the service owner and the operator of the service is typically formalized in a Service Level Agreement (SLA). The SLA will define the expected QoS (Quality of Service) that will be delivered by the service operator. The challenge for the service operator is to measure against the SLA and ensure that the service is consistently delivered at the appropriate level. Ultimately, the best QoS and cost efficiencies will be gained when the SLA lifecycle can be automated. The SLA life-cycle involves translating the SLA into individual Service Level Objectives (SLO) which are individual metrics that depend on Key Performance Indicators (KPI). KPIs are performance statistics that must constantly be measured to know if a particular SLO is being met or violated. The full SLA life-cycle is monitoring the SLO and making adjustments to the infrastructure when SLOs are violated or are in jeopardy of being violated.
A Service Level Management (SLM) tool measures KPIs to determine SLO violations. Many SLM tools use a reactive approach in which performance problems are identified after the fact an SLO violation has occurred. Some SLM tools use a more predictive approach by using self-learning techniques. These tools learn the typical behavior of the system by capturing daily, weekly, and monthly activities. They then compare the current performance metrics to the historical ones and trigger alarms when pre-set thresholds are violated.
Most SLM tools however, do not have any specific knowledge of the inner workings of elements such as application servers. Therefore, the SLM tool provides limited performance monitoring. For example, if a J2EE application makes requests of multiple back-end nodes such as directory servers, message queues or legacy systems, there is no easy mechanism to break down the response time across these components. Thus, when an SLO is violated, it is quite difficult to track down the actual cause of the violation. The problem is exasperated with web services, as a particular request may not only span multiple nodes within the datacenter, but may span across the internet as well.
Accordingly, there is a need for systems and methods that allow automatic discovery of problems pertinent to SLO's associated with SLAs.
The approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.