This invention relates generally to quality of service monitoring in a computer system and more particularly to a system and method for efficiently monitoring quality of service in a distributed processing environment.
Distributed processing environments offer distinct advantages over dedicated computer systems in terms of increased performance, portability, availability and resource sharing. However, monitoring the quality of service received by applications running in a distributed processing environment, including client-server and distributed applications, is a complex task. The mixture of operating systems and hardware platforms, the geographical distances between application components and the lack of programs to collect and correlate data from diverse sources must be considered when attempting to monitor the quality of service, particularly in terms of performance.
Prior art techniques for multiuser systems are poorly suited to addressing the problems of quality of service monitoring for client-server and distributed applications. Generally, these techniques were designed, deployed and managed in a centralized mainframe or minicomputer environment and consequently ignore communication costs incurred by delays between distributed application components and data interaction between independent software services.
Similarly, prior art techniques for network quality of service measuring are unsuitable. One such technique that employs software instrumentation for collecting and reporting on a large number of metrics, even if those metrics have not changed or are zero, is described in H. Rotithor, "Embedded Instrumentation for Evaluating Task Sharing Performance in a Distributed Computing System," IEEE Trans. on Instr. and Meas., Vol. 41, No. 2, pp. 316-321 (April 1992). The perturbation of this type of quality-of-service monitor on the system under measure is substantial due to the flooding of the distributed components with packets containing quality-of-service information and it becomes difficult for a user to comprehend the implications of the collected data. The measurement tools themselves significantly impact and degrade network performance, rather than helping to analyze or increase the quality of service and performance. Moreover, due to the perturbation effect, this type of technique cannot scale to monitor large networks with, for instance, thousands of individual computer nodes.
Prior art techniques that minimize the perturbation effect on the system under measure, but fail to measure internal software state, are described in U.S. Pat. No. 5,067,107 to Wade entitled "Continuous Computer Performance Measurement Tool Reduces Operating System Produced Performance Data for Logging into Global, Process and Workload Files;" M. Abrams, "Design of a Measurement Instrument for Distributed Systems," IBM Research Report RZ 1639 (October 1987); D. Haban et al., "A Hybrid Monitor for Behavior and Performance Analysis of Distributed Systems," IEEE Trans. on Software Engr., Vol. 16, No. 2, pp. 197-211 (February 1990) and F. Lange et al., "JEWEL: Design and Implementation of a Distributed Measurement System," IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 6, pp. 657-671 (November 1992). Moreover, as these systems employ, at least in part, hardware-based solutions, the incremental cost for implementation is prohibitively high in a large distributed environment.
A prior art technique for defining an efficient measurement infrastructure for a performance measurement system for heterogenous distributed environments is disclosed in R. Friedrich et al., "Integration of Performance Measurement and Modeling for Open Distributed Processing," Int'l Fed. of Info. Proc., Brisbane, Australia (February 1995). The system as disclosed teaches using asynchronous sensor access and control and data collection interfaces, but fails to define uniform application programming interfaces operable on a plurality of heterogenous network nodes. Moreover, the system as disclosed provides an overview but fails to describe detailed theory of operation principles.
Prior art techniques that concentrate only on individual component subsystems and fail to capture the entire picture of the distributed processing environment include the OpenView Network Manager and GlancePlus System Manager, both products by the Hewlett-Packard Company, Palo Alto, Calif. These techniques collect a restricted set of data for network resource consumption, such as links, routers and similar components, and for operating system resource consumption, such as central processing unit, memory, disk and related hardware components, respectively. These techniques collect data based on individual system events which result in excessive resource usage and overhead for supporting operational control of distributed application environments. Furthermore, neither provide a correlated, end-to-end measurement of the quality of service for client-server or distributed applications.
Therefore, there is a need for a quality-of-service monitoring method and system which provides a pervasive measurement infrastructure suited to monitoring the quality of service and performance in a distributed processing environment. There is a further need for a quality-of-service monitoring system and method able to correlate application resource usage across network nodes. Also, there is a need for a quality-of-service monitoring system and method capable of judiciously reporting data to minimize collection and processing overhead on networks, thereby incurring a substantially minimal perturbation on the system under measure.