1. Field of the Invention
This is a continuation application of U.S. patent application Ser. No. 10/418,565, filed on Apr. 17, 2003. This invention relates to technologies for controlling and automating corrective actions for computer systems, business application programs, and their resources.
2. Background of the Invention
Currently, most computer system management products are designed to handle a single customer's requirements under the assumption that only one customer is using all of the resources employed by the application program(s). In such a traditional arrangement, a computer processor unit or “CPU”, the memory it employs, and the persistent storage it uses (e.g. hard disk drives), are all dedicated to a single customer's usage.
As such, that customer's corrective action requirements may be fairly simply automated. Turning to FIG. 1, the general process (10) of existing computer system management products is shown, in which an analysis process is started (11) each time a fault (12) or other out-of-limit condition is detected. This may be a limit to take a corrective action when a hard drive unit is 90% full, for example.
An event management system usually queries (13) a single set of rules (16) for that customer to determine how to handle the event or condition. For example, in a particular customer's system, a nearly full hard drive unit may be a critical situation for a data intensive application, and as such, the appropriate action may be to send multiple alerts by pager, email, and printed report to support staff so that additional hard disk resources may be allocated or installed. In a different customer's application and system which is less dependent on hard drive storage, the condition may be less critical, and the rule may indicate to send a low-priority status or warning message by email to a service engineer.
So, based on these single-customer rules, the event management system (14) takes appropriate actions, thereby completing (15) the processing of the fault or condition. This creates a “one-size-fits-all” fault and out-of-limit condition handling process for the entire system and it's resources, assuming that a single customer or client is using all of those resources.
However, a business conditions have evolved recently, it has been found that deploying multiple infrastructures (e.g. multiple sets of resources) to support small or medium business is not cost effective. It has become desirable for some systems and service providers such as International Business Machines (“IBM”) to “host” multiple application programs for multiple customers on a set of shared resources. For example, a system may have a single processor unit, a single bank of RAM memory, and 2 hard drive units. Three customers' applications may be run simultaneously on this set of resources, with a first customer application using the first hard drive, and the second and third customers' applications using (e.g. sharing) the second hard drive. In this manner, a system or resource provider can share an infrastructure amongst multiple customers, thereby minimizing cost associated with unused (e.g. spare) resources, maintenance expenses due to duplicate hardware installations, etc.
System management professionals, however, are only provided with the traditional tools of “one-size-fits-all” (e.g. single-customer action response rule set) for taking corrective action on a server, and to alert a customer if a server has reached a certain percentage utilized or other actionable condition. As such, currently available system management tools and technologies are not currently capable of taking different actions for different customers if the customers share a single resource, and thus do not support the newer business requirements to host multiple customer applications on multiple shared resources, especially in situations wherein the thresholds, limits, and response actions for such multiple customers vary from customer to customer.
This further limits the ability of the service provider or hosting company to offer different levels of service, presumably for different fee or cost structures, when the applications are to be implemented or “run” on shared resources. For example, one client could not be offered a less expensive support plan which does not include any weekend or evening escalation responses, while another client is offered a support plan which provides immediate responses even during “premium” hours.
Therefor, there is a need in the art for a system and method which readily supports taking corrective action for conditions and faults detected in computing system infrastructures hosting multiple customer applications and sharing multiple resources, in which the corrective action rules are configurable and adjustable for each customer's requirements and are decoupled from a universal response scheme associated solely with each shared resource.