1. Field of the Invention
The present invention relates generally to expert systems and knowledge management and, more particularly, to systems and methods for assisting an operator where real-time support and automatic decision-making may be required.
2. Background and Related Art
It is known in the art that an expert system is a computer program intended to embody the knowledge and the ability of a human expert in a certain domain.
The objective of an expert system is to resolve a problem or give advice to resolve it. It can be, for instance, a system to answer a question from a non-expert user, or to react to an event. Generally, an expert system requires knowledge and data. Knowledge contains a set of rules that act upon data to accomplish the objectives of the system. Data represents facts and information concerning the specific domain for which the expert system runs. When reacting to an event, an expert system must respond reliably and quickly to treat the ongoing situation. This is particularly true when situations are continually changing. The expert system detects the event and determines the applicable actions in accordance to the class of the event and/or the circumstance in which the event appears. Then, the expert system evaluates the effects of its selected action and quickly initiates the event response mechanisms accordingly. Actual expert systems provide a solution to respond to a specific circumstance if the scenario exists. The computer normally applies heuristics and rules in a knowledge-specific domain to render advice or make recommendations, much like a human expert would. Expert systems have managed to achieve fairly a high level of performance in task areas that require a good deal of specialized knowledge and training. Often they perform tasks that are complex, tedious, or expensive to have a non-expert human perform.
Event Management Systems used to monitor and manage data centers work like “event expert systems”, specialized in the management of data center events. They receive events that they must analyze and to which they must react according to rules. To work effectively, they have their own representation of the environments they must monitor, through a data model like the one provided by the standard Common Information Model (CIM), describing the detailed information needed to monitor systems, networks and applications.
The Common Information Model (CIM) is an open standard that defines how managed elements in an Information Technology (IT) environment are represented as a common set of objects and relationships between them.
However, present event expert systems only manage events for which proven solutions exist and do not permit convenient management of an unexpected or unknown event (i.e. one that occurs for a first time) and/or recurrent events reappearing after event screening. The detection of such unresolved events triggers alerts to the operator console. In that sense, an alert is an event that could not be resolved by the event expert system.
Managing alerts differ from managing an event. Data models, such as the one provided by CIM for IT environments, are useless to IT operators in managing alerts. IT operators have, indeed, their own representation of the environment being monitored, made of other concepts. For instance, IT operators usually don't handle detailed technical information like IP addresses, but use rather the name of the application and the customer to identify the resolution action to be taken. IT operators, therefore, need a certain degree of common sense to interpret the information carried by the alert, to identify it unambiguously and finally, to make the correct decision. Failure to monitor addressing alerts can jeopardize system performance and management of the environment, particularly when monitoring data centers. The purpose of a data center is to host and run applications that handle the business (be it a core or a secondary business) and data of the organization, like operational data and/or decisional data and/or transient and/or audit data and so on.
Generally, a data center contains a set of servers, storage, firewalls, routers and switches that transport traffic between the servers and to/from the outside world. Some of the applications are composed of multiple components (like file servers, application servers, database servers and the like) running on multiple hosts. Some applications also make use of several infrastructure servers (e,g., LDAP, mail relays, load balancers). A complex modern data center hosts infrastructures made of shared, clustered and/or virtualized systems running multiple applications (such as ERP packages) and subsystems (such as database instances or transactions managers) for multiple customers, geographically dispersed, supported by multiple teams of systems engineers. In such an environment, subsystems do not always run on the same dedicated host, and every subsystem could serve several applications for several customers. Operators have to deal with this challenging complexity when analyzing and handling alerts issued from data centers.
Normally, when a recognized event occurs, the event expert system (e.g., an Event Management System monitoring the data center) manages it at the system level only, without interaction with the rest of the environment.
Unlike events, alerts need a human intervention. When the event expert system triggers an alert to the operator console, the alert is interpreted out of the system level by the operator, and is handled by some appropriate recovery actions.
The recovery actions consist in editing any of the alert messages before implementing a solution and cancelling them all, if necessary. To achieve this, the operator uses console procedures.
The operator starts certain tasks to recover the alert error by applying some recovery concepts provided from an operator step-by-step guide or the operator involves the assistance of a predefined set of decisions described in the guide emphasizing the actions to be run. Those skilled in the art will perceive numerous action support for assisting the operator all along the recovery process task.
Depending on the complexity of the data center, various different alert errors can potentially appear simultaneously, some of them can be unknown to the operator and/or not clearly indexed when searching an adequate solution in the operating manual. In addition, the alert message and the solution provided by the operating manual may be subject to interpretation that represents a risk about the solution assessment. Moreover, the adopted solutions may be subject to uncertainty about the underlying alert error that the operator tries to examine, since some of them may be obsolete because of new technology systems. Thus, it may be impossible to respond rapidly to the alert error and the action attempted may no longer be relevant.
To summarize, the aforementioned methods present several drawbacks. For example:
The information carried by alerts issued from an Event Management System does not match the concepts used by operator's reasoning.
Existing alert recovery makes difficult the process of finding out what the problem is and what to do about it.
Existing operating manuals present a risk about the identification of an alert and the solution assessment when used in a complex data center.
The solution presented by the operating manual may be obsolete when an unknown alert error is generated. The operating manual contains documentation to help identify an alert (such as, for example, the DB instance ‘xx’ on the IP address ‘zz’ runs for the customer ‘cc’). When a system arrangement moves from one configuration to another one, the operating manual may become quickly obsolete and thus does not reflect the organization changes. Thus, a well known alert may be transformed into an unknown alert that needs to be handled.
The action attempted is often no longer relevant in a case of paramount necessity and emergency.
There is no adequate assistance given to the operator when multiple alert errors appear simultaneously that slow down the process of identifying the alert, finding the solution procedure, and applying the solution.
These drawbacks are made worse in a virtualized environment, where the resources are shared among several applications and customers, thereby making the data center more complex to manage and monitor.
The present invention offers solutions to solve the aforementioned problems. Such solutions will be more apparent in the following description.