This invention relates to distributed computer systems, and more particularly to performance monitoring and control of server computers and applications.
Multi-tasking computer systems have existed for several decades, allowing a computer resource to be shared among many users, A computing resource, such as use of the central processing unit (CPU), is shared among different programs or processes running for different users.
Management of these concurrent processes was provided by simple UNIX commands. A UNIX process could be commanded to be nice, allowing other processes to have a higher priority and use a greater percentage of the CPU time. However, since a high-level user application could spawn many processes, determining which processes belonged to each user application was difficult.
Other performance-monitoring tools were developed, such as Hewlett-Packard""s PerfView monitor. Monitoring data for the different processes belonging to an application could be aggregated, allowing performance of a higher-level application to be monitored rather than the separate processes it spawned.
More advanced resource-manager tools such as Hewlett-Packard""s Process Resource Manager (PRM) have become available. Computing resources such as the target percentage of the CPU, main memory, or I/O channels could be allocated among applications.
While these resource-based measurements are still commonly used, the end user is more concerned with other metrics. The user cares more about when his job will be finished, or how long a web site takes to respond, than the exact percentage of a remoter server that he is allocated. Indeed, Internet users may not be upset if only allocated 1% of a server""s CPU, but may complain when a server""s response takes 2 minutes.
Metrics such as response time, job time, or availability are known as service-level measurements. Targets such as a database-application response time of less than 5%, or a server availability of greater than 99.95%, are known as service-level objectives (SLO""s). These objectives are defined in terms of the end-user service experience, rather than resource usage.
Monitoring products that measure against such service-level objectives are being developed, such as Hewlett-Packard""s Web Transaction Observer. However, when such SLO""s are not met, the burden is on the network administrator to determine what changes to make to meet the SLO""s. The administrator may have to reduce CPU usage of other lower-priority applications to improve the SLO of a failing application. However, this may cause the SLO of other applications to fall below targets.
Additionally, the SLO""s may not be met due to other factors, such as load balancing among a cluster of servers at a server farm. The SLO may depend on several layers of applications, any of which could be causing the SLO to miss. Complex multi-level e-commerce applications may include database back-ends and front-end server applications, as well as middleware layers of software. These software components may be distributed across several machines, or may reside on shared machines. The many interconnected components that together provide a service to an end user may even share the same CPUs, requiring that CPU usage be intelligently allocated among them all to maximize the service-level objective.
Determining system-management policies to maximize SLO""s is quite difficult. A system that can monitor overall SLO""s and adjust the lower-level resource allocations is desirable.