Nowadays, as information systems become ubiquitous and companies and organizations of all sectors become more and more dependent on their computing resources, the requirement for the availability of the hardware and software components of an IT infrastructure (basically a computer or telecommunication network, including applications and services based on it), is increasing while the complexity of IT infrastructures is growing. An IT infrastructure often comprises a diversity of systems, such as cables, repeaters, switches, routers, access points, work stations, servers, storage systems, some of which host operating systems, middleware and applications. The management of all these systems is becoming important, not only for large organizations, but also for medium-sized and small ones. A management of networked systems includes all the measures necessary to ensure the effective and efficient operation of a system and its resources in accordance with an organization's goals (see, for example, H. Hegering et al.: “Integrated Management of Networked Systems”, Morgan Kaufmann Publishers, 1998, pp. 83-93).
Networks and services related to them can generally be approached from a technical perspective or a business-oriented perspective which are, however, closely related. The technical perspective pertains to the technical aspects of network management, whereas the business-oriented perspective nowadays mainly deals with service level agreements (SLAs) between network providers and customers (see, for example, J. Lee et al.: “Integrating Service Level Agreements”, Wiley Publishing, Inc., 2002, pp. 3-25). The interrelation between them is that the quality of a service (QoS) delivered by the provider typically depends on the technical design, functioning and performance of the underlying IT infrastructure, and that measures of the QoS which reflect commitments made in an SLA are normally based on technical network management tools. Such QoS measures allow continuous tracking of the service being delivered and gauging whether service delivery conforms to the agreed-upon SLA.
From the technical point of view, the field of network management may be divided into five areas which are together sometimes called “FCAPS”: fault management, configuration management, accounting management, performance management and security management. The objective of fault management is to detect and correct faults quickly to ensure that a high level of availability of a networked system and the services it provides are maintained, whereby faults are defined as a deviation from the prescribed operating goals, system functions, or services. The tasks that have evolved from this objective include: monitoring network and system states, responding and reacting to fault notifications, such as alarms, diagnosing fault causes, establishing error propagation, etc. Configuration management deals with the adaptation of IT systems to operating environments and includes setting parameters, installing new software, expanding old software, attaching devices, making changes to a network topology. Accounting management includes tasks such as name and address administration, granting of authorizations and the accounting services. Performance management can be regarded as a systematic continuation of fault management, since it not only ensures that a communication network or a distributed system just operates but also wants the system to perform well. It deals with data like network throughput, network latency, error rates, storage resources, CPU workload, memory usage, server response time, connection establishment time, etc. Security management refers to the management of security in distributed systems, which contain the resources of a company that are worth protecting.
In order to accomplish these different management tasks, on the one hand, individual management tools (such as response-time testers, trouble-ticket systems, etc.) and, on the other hand, management platforms, such as OpenView by Hewlett-Packard, are available (see, for example, N. Muller: “Focus on OpenView”. CBM books, 1995, pp. 1-20; J. Blommers: “OpenView Network Node Manager”, Prentice Hall, 2001, pp. vi-xiii). A management platform, such as OpenView by Hewlett-Packard, integrates several management tools and management-related databases, and can typically be customized according to a user's specific installations and needs. In the framework of handling SLAs, fault and performance-related information produced by fault and performance management applications in the form of individual management tools and/or within a management platform (see, e.g., Blommers, pp. 119-174) may be used as QoS measures to track the QoS achieved and to verify the compliance of the QoS with commitments made in an SLA.
From the business perspective, the quality of an IT service provided is what counts, most pronouncedly for businesses which totally rely on IT services, such as e-commerce companies. A service in this context is a more abstract entity which typically is based on a combination of many elements of an IT infrastructure, such as cables, network interconnecting devices, gateways, servers, applications etc. Services are, for example, transport services through a network, name services, e-mail services, Web request-response services, database services, and more specific application services tailored to a company's individual business needs. More and more of such IT functions are outsourced to service providers. The service provider and the customer typically agree upon a contract in which the service provider commits himself to certain quality standards of the delivered service. The quality of a service is often quantified by one or more “service levels”, and a contract in which the two parties agree upon a certain (minimum) service quality is therefore called a “service level agreement”. Typically, SLAs define certain service level objectives (SLOs) which have to be met by the service provider. Typical SLOs are a certain availability of the IT infrastructure, a certain response time or network latency, a certain packet delivery guarantee, or an (often very complex) combination of such entities. One known type of SLO requires that a certain incident (e.g. a failure) does not occur at all; each single occurrence will then be considered as an SLO violation and may result in a penalty to be paid by the service provider. Another type of SLO pertains to entities cumulated over a period of time, called “compliance period” hereinafter, and requires that the cumulated entities are kept below or above certain thresholds; for example, it requires that the availability of a service is above 99.95% within a calendar month. As mentioned above, the actually delivered QoS is quantitatively determined, e.g. by measured fault and performance-related information produced by fault and performance management applications and, by a comparison of the measured values with the SLOs, it is decided whether the service delivered is compliant with an agreed-upon SLA.
Typically, a service provider committing himself to a service with a higher service level, for instance, a higher availability or shorter response time, can usually charge higher service fees than in the case of a lower service level, e.g. a less extensive availability or a longer response time. On the other hand, a provider guaranteeing a higher service level will either, more likely, fail to comply with the SLA and will thus have a higher risk of paying a penalty or damages, or will have higher operative and investment expenses to achieve the agreed-upon service level without such higher failure risk. A knowledge of the QoS actually delivered by the service provider is normally desired both by the service provider and the customer, during the running compliant periods as well as at the end of these periods.
To this end, an SLA management typically includes visibility and reporting requirements that can be divided into two categories: real time and historical. Real time reporting, for example, enables an IT operator to assess whether a fault is relevant for compliance with an SLA, and, consequently, prioritize counteractive measures. A historical report, at the end of a compliance period, is typically used for reconciliation purposes, i.e. to assess whether the SLA was complied with, to calculate penalties or damages, etc. (see, for example, Lee, pp. 49-52).