Today's large-scale Information Technology (IT) systems encompass multiple data centers, geographical locations, and diverse hardware and software platforms. Services are no longer confined to racks within a single data center—they may often be deployed and served from multiple locations. The management of large-scale IT infrastructure is becoming the focus for data center optimization and innovation. Within the area of service management, incident management is a main target for optimization because it is often a major portion of the work performed by the System Administrators (SAs) managing the system components. Other service management tasks include problem, change, and patch management.
Efficient management of IT operations and facilities is a major competitive advantage for service providers, given the massive scale and costs involved with today's IT service delivery infrastructures. In these environments, massive physical infrastructures (networking, power, cooling, security) exist to deploy and manage data centers, as well as run applications for different clients. System and application incidents and failures occur almost 24 hours a day, 7 days a week. Therefore, ideally, an incident management framework should be in place to respond to them in a timely manner and in accordance with customer Service Level Agreements (SLAs) and service delivery Service Level Objectives (SLOs). Proactive prevention and in-time response to failures with minimal operational costs is a major target for service providers. Proactive actions are typically enabled by the use of monitoring tools that allow SAs to observe in real-time the performance and status of the management components through sampling of Key Performance Indicators (KPIs). When KPI variation indicates that a managed component is in or approaching a state that would lead to an SLA or SLO violation, notification messages, called alerts, are automatically generated and sent to an SA. Alerts may be delivered by, for example, electronic mail messages or incident tickets in the Incident Management tools, or other means. The generation of monitoring alerts is typically determined by the monitoring policy deployed on the managed server. The policy may consist of one or more monitoring rules that describe conditions involving KPIs, processes, and other system operation components. Generally, alerts are generated when the conditions in the monitoring rules hold true. Sometimes, false positive alerts may be generated. A false positive is an alert that has been generated, for example, when the conditions in the monitoring rules hold true, even though the monitored system is performing properly and no SLA/SLO failure exists. An SA's time is not efficiently spent if he has to handle false-positive alerts. To reduce the number of alerts, in general, and of false alerts, monitoring rules typically need to be customized for fine details of the workload running on the managed systems.
In large-scale IT systems, substantial resources are typically required for managing the monitoring systems and for serving the monitoring alerts generated by these systems. Depending on the size and complexity, managing the IT infrastructure's operations can cost companies billions of dollars. Incident management systems such as IBM® Tivoli® and HP ServiceCenter® are examples of conventional approaches to handling the logging of monitoring alerts and incidents, dispatching them to appropriate system operators, and tracking their resolution. Furthermore, the timeliness in resolving these issues is critical, as IT service providers and clients have SLAs and SLOs that specify the maximum time-to-resolve for issues with different severity levels. Failure to meet SLAs often results in financial penalties and damages to the relationship with the clients.