As computer systems become more complex, they become more difficult to manage efficiently, and the problems that occur within them become more difficult to isolate. One form of complex computer system on which society increasingly relies is the database system. Conventional database systems consist of one or more clients (“database applications”) and a server (a “database server”). When a client requires data, the client submits a query to the server, where the query includes criteria for selecting the data. The server retrieves the data that satisfies the specified criteria from a database and returns copies of the selected data to the client that submitted the query. More complex database systems may include numerous servers that share access to one or more databases, where each such server may be serving thousands of clients.
Another type of complex computer system is known as an application server system. An application server system typically consists of one or more clients (“browsers”), a server (“application server”), and applications (“cartridges”). Users of the browsers execute the cartridges by causing the browsers to send and receive messages through the application server to the cartridges. FIG. 1 is a block diagram of an exemplary application server system 100. The system 100 includes a plurality of browsers 102, 104 and 106 that communicate with a plurality of listeners 110, 116 and 122 over the Internet 108 according to the HTTP protocol. In response to requests from the browsers, the listeners cause a web application server 180 to invoke software modules, referred to herein as cartridges. In the illustrated embodiment, web application server 180 has initiated the execution of three cartridges 130, 134 and 138.
The web application server 180 is composed of numerous components, including transport adapters 112, 118 and 124, dispatchers 114, 120 and 126, an authentication server 152, a virtual path manager 150, a resource manager 154, a configuration provider 156 and a plurality of cartridge execution engines 128, 132 and 136. A typical operation within system 100 generally includes the following stages:
A browser transmits a request over the Internet 108.
A listener receives the request and passes it through a transport adapter to a dispatcher.
The dispatcher communicates with the virtual path manager 150 to determine the appropriate cartridge to handle the request.
At this point the dispatcher does one of two things. If the dispatcher knows about an unused instance for that cartridge, the dispatcher sends the request to that instance. If there are no unused cartridge instances for that cartridge, the dispatcher asks the resource manager 154 to create a new cartridge instance. After the instance starts up successfully, the cartridge notifies the resource manager of its existence. The resource manager 154 then notifies the dispatcher of the new instance. The dispatcher creates a revised request based on the browser request and sends the revised request to the new instance.
The cartridge instance handles the revised request and sends a response to the dispatcher.
The dispatcher passes the response back through the listener to the client.
Application server systems and database systems can be combined. For example, some or all of the cartridges that are operated through the browsers in an application server system may in fact be database applications that, in turn, issue queries to one or more database servers in response to messages from the browsers. Due to the complexity of such combined systems, it is exceedingly difficult to identify the cause of performance problems. For example, assume that a browser receives an extremely slow response to a message that is sent to an application server and dispatched to a cartridge, where the message causes the cartridge to issue a query to a database server, where the database server executes the query to retrieve data for the response. Under these conditions, the slow response time may be due to problems with any of the entities involved, or with communication problems between the entities.
The identification of unacceptably slow response times may be of interest to users as well as to the administrators responsible for managing the computer system. For example, the subscription agreement of a user may guarantee a particular level of performance (e.g. that 98% of all orders be processed in less than one minute). Users with such subscriptions would typically be interested to know when, and how often, the system is not meeting the specified level of performance.
The process of ensuring that the system is able to meet the performance requirements of users is generally referred to as capacity planning. Typically, there are six general phases in the capacity planning process:
(1) setting up the service level objectives;
(2) estimating the demand for the resources of the system;
(3) identifying resources that satisfy the estimated demand;
(4) implementing the system with the identified resources;
(5) determining whether the system actually satisfies the demand; and
(6) repeating steps (2) to (5) when the system fails to satisfy the demand.
The step of determining whether the system satisfies the demand may be accomplished, for example, by periodically analyzing statistical information relating to the system. However, the amount of processing that such analysis may require can be so enormous that, if performed at a reasonable frequency, the analysis overhead itself may result in a violation of the service level commitments made to users.
The better the tools that are made available to the capacity planner, the higher the likelihood that the implemented system will satisfy the anticipated demands. Further, when the implemented system is not satisfying the anticipated demands, the easier it will be to determine and fix the problems that are preventing the achievement of the desired performance levels.
A number of systems have been developed for problem identification and planning. For example, systems for problem determination in performance management are described in B. Arinze, M. Igbaria, and L. F. Young: “A Knowledge Based Decision Support System for Computer Performance Management,” Decision Support Systems 8, 501-515, 1992 and Bernard Domanski: “A PROLOG-based Expert System for Tuning MVS/XA,” Proceedings of the Computer Measurement Group, 160-166, 1987. A system for process control is described in D. R. Irwin: “Monitoring the Performance of Commercial T1-rate Transmission Service,” IBM Journal of Research and Development, 805-814, 1991. A system for planning cooking recipes is described in Janet Kolodner: “Case-Based Reasoning,” Morgan Kaufmann Publishers, Inc., 1993. Systems for problem identification of electrical circuits and analysis of financial statements are described in Robert Milne: “Using AI in the Testing of Printed Circuit Boards” National Aerospace & Electronics Conference, Dayton Ohio, May 1980, and Donald W. Kosy and Ben P. Wise: “Self-Explanatory Financial Planning Models,” Proceedings of the National Conference on Artificial Intelligence, 176-181, 1984. However, none of these systems address the domain of systems management, nor do they consider problem discovery and capacity planning.
An attempt to apply multidimensional database technology to systems management, which focuses on performance management for data from a single source, is described in Robert F. Berry and Joseph L. Hellerstein: “An Flexible and Scalable Approach to Navigating Measurement Data in Performance Management Applications,” Proceedings of the Second IEEE International Conference on Systems Management, June, 1996. Another attempt to use multidimensional navigation for sales/subscription handling is described in Business Objects: A. M. Burgeat and F. Prabel, “Data Warehousing: Delivering Decision Support to the Many,” Business Objects Corporation, 1996.
Based on the foregoing, it is clearly desirable to provide techniques that allow problems within complex computer systems to be isolated, and to assist in planning such systems to comply with user requirements, and to maximize system capacity and avoid bottlenecks.