An information technology infrastructure library (ITIL) may be viewed as best practice in managing information technology infrastructure, development, and operations. An aspect of ITIL is IT incident management. In ITIL terminology an incident has been defined as: “An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet impacted service is also an incident”. In general, incident management is a process which deals with incidents.
An incident management tool is one supported by an incident ticket system (ITS) which is a software system that runs in an organization and records as a ticket a malfunction and/or an affected service. A ticket is a record which contains information about the failure or malfunction, as well as information concerning support interventions made by technical support staff or third parties on behalf of an end user who has reported an incident. Tickets can be automatically issued by monitoring systems when they recognize a degradation of the IT system.
In such an ITS there can coexist different categories of tickets (e.g., from the end user or from the monitoring system) without any explicit relationship with each other. While information about failed or disrupted services and/or resources can co-exist, this information can be scattered over the system. One detrimental result is that the connection/relationship between a failed resource and a malfunctioning service cannot be realized automatically. While the connection may possibly be realized manually, in a system of any appreciable size the manual approach can be time consuming, expensive and inherently unreliable.
In general, Incident Management and Problem Management are two of the Service Operation processes in the ITIL. These two processes aim to recognize, log, isolate and correct errors which occur in the environment and disrupt the delivery of services. Incident Management and Problem Management form the basis of the tooling provided by the ITS.
There has been considerable research related to the correlation of trouble ticket/symptoms/events for Incident and Problem Management and fault diagnosis.
In Dreo, G., A Framework for Supporting Fault Diagnosis in Integrated Network and Systems Management: Methodologies for the Correlation of Trouble Tickets and Access to Problem—Solving Expertise. DISS, Ludwig—Maximilians—Universität München, 1995, there is a proposal to use trouble-ticket correlation for discovery of tickets and access to problem-solving expertise. Dreo argues that good models for the functional and topological (i.e., resource mapping) aspects of a service are key elements for high-quality correlation.
A. Hanemann, Automated IT Service Fault Management Based on Event Correlation Techniques. PhD thesis, University of Munich, Department of Computer Science, Munich, Germany, 2007, proposes an algorithm for event correlation, which was extended in A. Hanemann and P. Marcu, Algorithm Design and Application of Service—Oriented Event Correlation. Proceedings of the 3rd IFIP/IEEE International Workshop on Business—Driven IT Management (BDIM 2008), Salvador Bahia, Brazil, 2008. The algorithm is based on the same service model as in B. Gruschke, Integrated Event Management Event Correlation Using Dependency Graphs. Proceedings of the 9th IFIP/IEEE International Workshop on Distributed Systems: Operations & Management (DSOM 98), pages 130-141, Newark, Del., USA, 1998. Events are correlated for root-cause analysis using Rule-Based Reasoning (RBR) and active probing.
K. Chang and H. Carlisle and J. Cross and P. Raman, A self-improvement helpdesk service system using case-based reasoning techniques. Computers in Industry, pages 113-125, New York, 1996, proposes a system for self-improvement help desk service that uses Case-Based Reasoning (CBR). This techniques emphases the importance of searching through the descriptions of a ticket. E. Liddy and S. Rowe and S. Symonenko, Illuminating Trouble Tickets with Sublanguage Theory. Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 165-172, New York, 2006, describes a similar approach using RBR techniques for discovering the historical and predictive value of trouble ticket data. Both these approaches use keyword search. However, the likelihood of incorrect correlation results is relatively high because, often, the highly relevant keywords are difficult to determine.
R. Gupta and K. Prasad and M Mohania, Automating ITSM Incident Management Process. Proceedings of the 5th IEEE International Conference on Autonomic Computing, pages 141-150, Chicago, 2008, proposes an automated algorithm for correlating an incoming incident with configuration items of the CMDB based on a keyword search of the CMDB.
Adaptive probing techniques (see, I. Rish and M. Brodie and S. Ma and N. Odintsova and A. Beygelzimer and G. Grabarnik and K. Hernandez. Adaptive Diagnosis in Distributed Systems. IEEE Transactions on Neural Networks (special issue on Adaptive Learning Systems in Communication Networks), 16(5):1088-1109, 2005, and I. Rish and M. Brodie and N. Odintsova and S. Ma and G. Grabarnik, Real-time Problem Determination in Distributed Systems Using Active Probing. Proceedings of the 9th IFIP/IEEE International Network Management and Operations Symposium (NOMS 2004), pages 133-146, Seoul, Korea, 2004) use a measurement technique that allows fast on-line inference about current system state via active selection of only a small number of most informative probes.
J. E. Stanley and R. F. Mills and R. A. Raines and R. O. Baldwin, Correlating network services with operational mission impact. Proceedings of the IEEE Military Communications Conference (MILCOM), pages 162-168, Chicago, 2005, exploits the relationships captured in CMDB regarding services, components and users to determine the impact of network outages on services and users. Namely, metadata in the network packets blocked by an outage identify the services and users immediately affected and CMDB relationships help determine the further impact.
Reference can also be made to “Algorithm Design and Application of Service-Oriented Event Correlation”, Andreas Hanemann, Patricia Marcu, BDIM, NOMS 2008.