1. Field of the Invention
The present invention relates to information processing technology. More particularly, the present invention relates to a system and method dynamically generating and cleaning up event correlations resulting from various causes and effects being monitored in a computer system.
2. Description of the Related Art
One of the highest priorities of information technology (IT) organizations responsible with managing mission-critical computing environments is to ensure that problems, as well as conditions that could lead to problems, are handled in a timely and efficient manner. Event correlation managers are software systems that are designed to collect and respond to events that occur in the computer system. Events may come from a variety of sources. Examples include events that occur: (1) when a link to another computer system goes down, (2) when a router used for routing information goes down, (3) when a database is down, (4) when the system processor is maximized, or xe2x80x9cpegged,xe2x80x9d for an extended period, (5) when a disk is full, (6) when one or more applications that make up a critical business function (i.e., order entry) go down, (7) when a critical application program""s performance degrades beyond an acceptable level, and (8) when a host computer is going down.
As used herein, a xe2x80x9cbusiness systemxe2x80x9d serves the needs of the organization""s critical functions, such as order entry, marketing, accounts receivable, and the like. A business system may span several dissimilar types of computers and be distributed throughout many geographical locations. A business system, in turn, is typically based upon several application programs. An application program may also span several dissimilar types of computers and be distributed throughout a network of computer systems.
An application typically serves a particular function that is needed by the business system. An individual application program may, or may not, be critical to the business system depending upon the role the application program plays within the overall business system. Using networked computers, an application may span several computer systems. In an Internet commerce system, for example, an application program that is part of the company""s order processing business system, may be responsible for serving web pages to users browsing the companies online catalog. This application may use several computer systems in various locations to better serve the customers and provide faster response to customer inquiries.
The application may use some computers running one type of operating system, for example a UNIX-based operating""system such as IBM""s AIX(copyright) operating system, while other computer systems may run another type of server operating system such as Microsoft""s Windows NT(copyright) Server operating system. Individual computer systems work together to provide the processing power needed to run the business systems and application programs. These computer systems may be mainframes, mid-range systems, workstations, personal computers, or any other type of computer that includes at least one processor and can be programmed to provide processing power to the business systems and applications.
Computer systems, in turn, include individual resources that provide various functionality to the computer systems. For example, a modem is an individual resource that allows a computer system to link to another computer system through an communication network. A router is another individual resource that routes electronic messages between computer systems. Indeed, even an operating system is an individual resource to the computer system providing instructions to the computer system""s one or more processors and facilitating communication between the various other individual resources that make up the computer system. Events, as described herein, may effect an entire business system, an application program, a computer system, or an individual resource depending upon the type of event that occurs.
The number and types of events that may occur vary widely from system to system based upon the system characteristics, load, and desired use of the system. An business system providing content from an Internet site may experience different events than a business system used to process a the company""s payroll. However, many events between dissimilar systems overlap. For example, many computer systems experience problems when the disk space is full and many computer systems experience problems when the system""s processor is pegged. The types of problems these events cause, however, will vary depending upon the types of work that the business system is expected to perform.
In the Internet site example, a pegged processor is likely to result in applications interfacing with Internet users to become stalled or unusable and transaction throughput to stall or become exceedingly slow. In the corporate payroll system, the same pegged processor may result in critical software applications that make up the payroll application stalling or becoming exceedingly slow. The causes of the pegged processor may also be different depending upon the usage of the computer. An Internet server""s processor may become pegged due to receiving more requests from Internet users than can be handled. The corporate payroll system""s processor may have become pegged due to multiple processor-intensive business applications running simultaneously on the system.
Traditional event correlation managers are usually designed as hierarchical rule-based systems. After an event monitor detects a certain event, the correlation manager processes the event using the rules that have been predefined in the system in order to determine the likely cause of the event. Software vendors providing event correlation managers often provide a rule editor that allows customers to edit the rules that apply to the customer""s system.
Event correlation managers typically receive signals, or messages, from event monitors that monitor business systems, applications, computer systems, and individual resources (collectively, xe2x80x9cbusiness system and componentsxe2x80x9d). These event monitors are often programmed to filter information from the business system and components being monitored. The filtering criteria is often preset so that certain conditions are filtered out as non-problems while other conditions are trapped and correspondingly sent to event correlation managers for processing the given event. Traditional event monitors are challenged by the fact that the filtering criteria is preset or coded into the event monitor itself making it difficult or impossible to dynamically alter the monitoring criteria used for a particular device or piece of software. Traditional event correlation managers, like their traditional event monitor counterparts, also face challenges in dealing with the complexities of today""s modern business system and components.
One challenge with traditional event correlation managers is that the creation, modification, and maintenance of the rule base is a centralized activity resulting in a centralized set of rules. An area or individual within the IT organization may be responsible for updating the rules. However, with the complexity of modern business system and components, it is unlikely that one person or even one area will be the most knowledgeable about all of the event producing hardware and software in the business system and components nor will such person or area likely be the most knowledgeable concerning the possible effects that occur when a certain event occurs. It is also unlikely that centralized IT individuals or groups will be the most knowledgeable about what corrective action should be implemented when a certain event occurs. The IT group may have sufficient knowledge to allocate additional disk space if a disk full condition arises, however the same group may not have expertise with a certain database management system (DBMS) that may crash or perform below a minimal acceptable threshold, nor may that group have sufficient knowledge regarding the business system and components. In the database example, a database administrator (DBA) with particular expertise would likely be a better source of knowledge with actions to take when certain database conditions occur.
Involving various knowledge base employees with expertise in particular fields is further challenged by a centralized rule-based hierarchical event correlation manager because one area, typically the IT organization, controls the maintenance of the correlation manager. Receiving input from other people in various parts of an organization presents logistic and managerial challenges that traditional systems have difficulty handling.
Another challenge faced by traditional event correlation managers is the complexity of the rules and the complexity of the hierarchy structure of the rule base. As business systems and components become more complex, the events that may occur, both the causes and the effects, become ever more complex. A rule base is often organized in a hierarchy of nested xe2x80x9cif-then-elsexe2x80x9d types of conditions. As an example, consider the following pseudo-code that might exist in an event correlation manager""s rule base pertaining to one particular critical application:
IF critical_application_down THEN DO
IF link_down THEN DO
CALL NOTIFY_NETWORK_ADMIN
. . .
END
IF database_down THEN DO
IF disk_space_full THEN DO
. . .
END
END
IF processor_pegged THEN DO
IF large_application_running THEN DO MESSAGE TO large_application to halt CALL LARGE_APPL_SUPPORT
END
ELSE
CALL NOTIFY_ADMINISTRATOR
IF . . .
END
. . .
EXIT
END//end critical_application_down section
As illustrated by the above-example, the rules-based approach often results in a large nested set of rules that becomes increasingly complex as the computer system changes or evolves. Changes made to the business system and components may not be reflected in the rule base until certain errors have occurred, been diagnosed, and entered in the rule base. The resulting rule base becomes exceedingly complex, and therefore, exceedingly difficult to manage as the computer system evolves and increases in complexity. When changes are made to the computer system are not reflected in the rule base, the event correlation manager cannot manage the events and take the corrective action necessary. In addition, making changes to the business system and components without making corresponding changes to the rule base may result in phantom errors with the system trying to act upon computer system hardware and software that may no longer exist in the business system. For example, one of the events corresponding to the critical_application_down error in the above example is if a database is down. If the database is replaced or moved to a different system, the database_down condition may exist because of the system change, not because the database is actually down. The result of such phantom errors may be performing unnecessary, and potentially harmful, corrective actions and causing further confusion amongst the IT personnel as to which events have cause the current system outages and failures.
The complex hierarchical structure of traditional event correlation managers, coupled with the centralized maintenance of such systems, creates a formidable task for IT personnel to manage. This task is especially difficult in the face of increasing complexity of business systems and an ever-widening array of components and applications that comprise today""s modern business system.
It has been discovered that a dynamic object-oriented approach to correlating events in a computer system has certain advantages desirable over traditional static hierarchical rule-based event correlation managers found in the prior art. A dynamic object-oriented approach allows for a more simplistic structure correlating events, including causes of events and effects of such events, found in a computer system. The dynamic object-oriented approach allows expertise of various areas or individuals to be brought together as needed without one individual or area responsible for understanding all aspects of the computer system.
In the dynamic object-oriented approach, object templates are created by the subject matter expert for each type of event needed to be handled by the system. Individual subject matter experts, organizational areas, or third-party vendors provide expertise related to efficient and precise handling of given events. These object templates can be created with little knowledge of other events that may occur in the computer system. Object templates include logic for responding to the particular event. For example, a database_down event object may include a program call to a database management system utility to attempt to fix the database problem, send an email message to an administrator for intervention, and send a message to a pager carried by the database administrator responsible for the database. In fact, the database administrator, likely having greater knowledge of the database than general IT personnel, may be the subject matter expert responsible for maintaining the database_down object template without need of a central hierarchical rule base. The object templates are repeatedly refined and fine tuned in an ongoing process that accounts for changes in the business system and components as well as a better understanding of the component being monitored and better understanding of the causes that cause the component to fail and the down-stream effects that a failure in the component, such as the database, causes.
After object templates are created, correlations are created between cause event objects and effect event objects by users that need not have the same expertise as the subject matter expert that created the object templates. The correlations created between objects enable the objects to logically find one another after an event monitor causes an object to be created. Cause event objects can be directly correlated to effect objects in a one-to-one relationship or multiple cause event objects can be correlated to effect event objects through logical constructs. Multiple cause event objects are correlated to effect event objects using logical constructs such as xe2x80x9cOR,xe2x80x9d xe2x80x9cexclusive orxe2x80x9d (XOR), xe2x80x9cnot ORxe2x80x9d (NOR), xe2x80x9cAND,xe2x80x9d and xe2x80x9cnot ANDxe2x80x9d (NAND). In this manner, a predicted effect object can be correlated to multiple causes. By correlating predicted causes with effects and predicted effects with causes, the user can create cause-effect correlations.
Cause events are dynamically correlated to effect events, and effect events are in turn dynamically correlated to cause events, through a subscription mechanism used with the various cause and effect events. In this manner, when created, a given cause event object will send a message to all other objects (i.e., effect event objects) that are subscribed to the given cause event object. Likewise, a given effect event object, when created, will send a message to all other objects (i.e., cause event objects) that are subscribed to the given effect event object. Through the subscription mechanism, cause and effect objects will locate each other, dynamically creating a correlation circuit, whenever a cause or effect object is created in the system, regardless of which object was created first by the dynamic object-oriented event correlation system.
After an event correlation circuit has been dynamically generated by the system, it is dynamically cleaned up once the events causing the event correlation circuit have been handled. One way to clean up an event correlation circuit is to create a new clean up event that is created by the event monitor once the condition causing the event no longer exists. The cause or effect object is subscribed to the clean up object and responds to the existence of the clean up object by terminating its processing. In addition, objects can be self-monitoring so that they determine when to terminate without needing to receive external commands. In this manner, an object can monitor system characteristics and determine when the event no longer exists and terminate automatically.
Another way an event correlation circuit can be cleaned up is by providing a clean up object that is sent by an administrator to clean up one or more event correlations. In this manner, global or semi-global clean up commands can be issued by the administrator, or automated process, cleaning up one or multiple event correlation circuits. A further way of cleaning up an event correlation circuit is provided by a time-based constraint on the individual object or on the event correlation circuit. After a prescribed length of time, the event correlation circuit would terminate. However, if the condition for dynamically creating the event correlation circuit still exists, the event monitors would once again create the appropriate cause and effect objects recreating the event correlation circuit. On the other hand, if the conditions for the event correlation circuit no longer exist (indicating that the event correlation circuit is no longer applicable to current state of the system) the event monitors would not re-create the cause and effect event objects and the correlation circuit would simply cease to exist.
Event monitors that monitor for a given event are also object-oriented to allow for dynamic changes to the criteria used by the monitor in monitoring specific events. Data collected by the event monitor are, in turn, used to create a self-examining event monitor object based on dynamic criteria set by the event monitor itself or based upon another system analysis software component. Dynamically adjusting the event monitoring criteria allows for fine tuning the event monitor to the latest needs and capabilities of the computing environment. Event monitoring criteria are supplied by system administrators issuing change requests to event monitoring criteria in addition to dynamic criteria changes performed by the event monitor and other system analysis software. Event criteria changes are alternatively time-based whereupon such changes exist for a certain time interval before reverting back to the original preset criteria.
Event criteria changes are also alternatively based upon other events so that when another event object is either created or destroyed, the event criteria again reverts back to the original preset values. For example, a disk_full event may normally be triggered when the disk drive event monitor senses that the disk drive is 90% full. Based on a self-assessment, the disk drive event monitor may reset its own threshold to 95% if it determines the system processor is operating below a certain utility level (i.e., little swap space is being utilized). Alternatively, a system administrator or external process may raise the threshold temporarily to 95% when running an application that will create large temp files that will be erased within a certain time period.
The dynamic object-oriented cause/effect correlations and dynamic monitoring criteria provide distinct advantages over the hierarchical, rule-based event correlation managers found in traditional systems. The dynamic object-oriented cause/effect correlations are better able to manage complex business systems that static, hierarchical rule-based correlation managers. Dynamic event monitors also provide a more flexible and adaptable filtering mechanism for monitoring events in complex business systems than traditional event monitors.
The foregoing is a summary and thus contains, by necessity, simplifications, generalizations, and omissions of detail; consequently, those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting. Other aspects, inventive features, and advantages of the present invention, as defined solely by the claims, will become apparent in the non-limiting detailed description set forth below.