Distributed data processing may pose a unique set of management challenges. Because functionality is, typically, distributed and may interact across a wide variety of communications media including, but not limited to, local area networks (LANS), wide area networks (WANS), satellite communications, cellular networks, packet radio networks, and so forth, it may be difficult to manage service quality in such systems, in locating the components causing service quality problems, and/or in allocating resources to improve service quality. Because a data processing system may be composed of a number of physical and logical systems, and these systems may in turn host a great number of software components, which in turn, may host more dependent software components, the problem may not be just one of distribution, but of complexity as well.
Many of these discrete hardware and software components may be instrumented to provide visibility into the status and/or state of the specific component and of the data processing system comprised, in whole or in part, of these components. A distributed data processing system may include hundreds, or even thousands of these components. Each component may have tens or hundreds of instrumented measures and attribute data. The volume of data available for inspection may make it difficult, if not impossible, to ascertain the causes of service quality problems in complex distributed data processing systems.
Conventionally, ascertaining the causes of service quality problems has, typically, been provided by component status evaluation and/or service status evaluation. Component status is, typically, evaluated through the use of component monitors. Service status is, typically, estimated by correlating and/or aggregating component status to a service through the use of a component-to-service mapping. Typically, no direct measure of service status is used.
Service status may be measured directly through the use of service monitors. Service monitors may be active testing monitors, passive monitors, or a combination of the two. Component status may be evaluated through the use of component monitors. Component status may be time-correlated with service status. Time correlation may occur with or without a service-to-component mapping.
When determining component status one existing approach has been to generate a “trap” by an instrumented component when the component instrumentation detects a problem with the component. In the event that a large number of components experience a problem, a large number of these traps may be generated as well. The problems experienced and traps generated may be independent, or they may be causally linked; where a problem with one component causes a problem to be detected in one or more subsequent components. Conventional systems attempt to provide for the reduction, correlation, analysis, and display of these component traps to reduce the number of traps presented to an operator to a manageable number, and to help operators find the root cause of a system problem.
Service quality could be affected when one or more components involved in the delivery of the service experience a problem or problems. Some method of mapping component traps to services is typically applied to determine that a component trap may be associated with the delivery of a service. This mapping may be as simple as listing the components involved in the service, or more advanced techniques for service-to-component mapping may be applied.
To more directly detect deficiencies in performance and/or availability another approach has been developed; the monitoring of service transactions to ascertain compliance to one or more standards of performance and/or availability. Two well-known methods used for ascertaining performance and/or availability are active testing and passive monitoring.
An active testing approach may, typically, use simulated transactions to exercise a service. These simulated transactions are typically designed to represent the types of transactions actual users of the service would execute. Users of the service may include people interacting directly with the service via a human/computer interface, or intermediate computers acting under programmatic control on behalf of users. These simulated transaction generators may be located completely within a management domain, or they may be located in multiple management domains, as in the case where robotic transaction generators located at diverse points in the Internet exercise services delivered via the Internet.
In the passive monitoring approach, typically, an “agent” monitors actual users of the service with little or no perturbation of the service. These passive monitoring systems may be implemented as an agent on a client computer, as an agent running on a dedicated monitoring system, as an agent on a host system, and/or as a combination of two or more of these implementations.
Service monitoring approaches combining active testing and passive monitoring may be implemented as well.
In both the active testing approach and the passive monitoring approach performance and/or availability may be measured and compared to a standard or standards on an ongoing basis. Such standards are often referred to as “service level agreements.” In the case of the active testing approach periodic tests may be run. In the case of the passive monitoring approach the execution of actual transactions may activate the monitoring function.
When service level agreements are not met service traps may be generated which indicate non-conformance. These service traps may be reduced, correlated, analyzed, and reported on just as component traps may.
Attempts have been made to correlate performance and/or availability monitoring with component health monitoring. This is typically accomplished through a common user interface for viewing measurement data and through a common trap management and correlation interface for managing and handling traps. The performance and/or availability monitoring approach and the component health monitoring approach frequently operate independently; they are decoupled, or loosely coupled through trap correlation methods and data display methods. In both cases this approach often relies upon after-the-fact time correlation between performance and/or availability issues and component traps. By examining performance and/or availability problems and component health problems that occur near each other in time, operators may deduce some degree of causality between component health problems and performance and/or availability problems.
Another approach that may also be utilized is the use of a performance and/or availability monitor in conjunction with log file inspection. Instead of time correlating a service trap with component traps, component logs may be subsequently examined for anomalies occurring near the time of the performance and/or availability problem. This is time-correlation of decoupled data, where the collection of performance and/or availability status is not linked to the collection of log status until well after the component trap has occurred.