The present invention relates to distributed event management in telecommunication and data networks, and more particularly to the use of knowledge-based and distributed systems technologies for performing event correlation and notification for network fault, performance and test management.
Since the first computer network came xe2x80x9conlinexe2x80x9d there have been network problems, disorders and anomalies that periodically occur in the network hardware, software, or both. They are sometimes spurious, transient, redundant, time correlated, or too numerous to be handled at the same time. Given the size and dynamic nature of modern telecommunication and data networks, it is no wonder that the task of identifying network problems continues to baffle software engineers the world over. Exacerbating the problem is the reality that a single fault may sometimes result from a hardware problem and other times from a software problem. With the explosive growth in the size and complexity of networks, it is also not uncommon for a burst of alarms during a major network failure to reach 100, 200 and more alarms per second. Under these conditions, systems personnel of all experience levels confront an inability to follow the stream of incoming events, often leading to alarms being noticed too late, or not at all. When the alarms are eventually noticed, all too often corrective measures are determined based on a single alarm or on incomplete subset of the active alarms, potentially complicating the already onerous situation.
Such delays can be costly in large networks, which are heavily relied upon to quickly move vast amounts of data in short periods of time to carry out the normal course of business. For example, large financial institutions rely upon such systems to reflect the transfer of large sums of money electronically. Loss of that ability even for a relatively short period of time may be very costly to the institution and its clients. Similarly, airlines rely upon such systems to track passenger reservations. Loss of that ability can result in flight delays or cancellations and loss of customers.
In an effort to assist network management personnel in resolving these problems, a variety of network management systems to monitor network operations have been developed. These systems were generally capable of performing network surveillance and monitoring functions, and in some cases they were able to diagnose simple network faults.
As the size and complexity of networks grew, it became clear that the traditional network management systems could no longer simply report problems, and instead required intelligent analysis and diagnostic capabilities in order to be effective. Such a system must monitor network events, associate related events with each other, infer possible root causes of events, determine the impact of events on network traffic, present the current state of the network, and recommend appropriate actions. In other words, the network management systems must exhibit some level of intelligence in analyzing the incoming events, understanding the surrounding management context, testing connectivity between network elements, identifying patterns in the stream of events, and suggesting corrective actions. The systems should be able to explain their actions, learn from their past behavior, and present the results in a form easily comprehendible by the network management personnel. To a very large extent, many of the functions listed above are based on a fundamental capability of real-time event correlation. Formally, event correlation is a conceptual interpretation procedure that assigns new meaning to a set of events. Algorithmically, event correlation is a dynamic pattern matching process over a stream of events. These events may include: raw events, status and clear messages from network elements (NEs); events from mediation devices, subnetwork management systems, test systems, environmental sensors and other equipment; user action messages from network operator terminals; and system interrupts. In addition to the real-time events, the correlation patterns may include network topology information (e.g. network connectivity), diagnostic test data, data from external databases, and other ancillary information. Event correlation enables several event management tasks, including: (1) reducing information load by dynamic focus monitoring and context-sensitive event suppression and filtering; (2) increasing the semantic content of information through generalization of events; (3) fusion of information from multiple sources; (4) real-time fault detection, causal fault diagnosis, and suggestion of corrective actions; (5) ramification analysis of events and prediction of system behavior; and (6) long-term trending of historic events.
Real-time event correlation has been used for well over a decade with applications in various fields, not the least of which is network management. Today, event correlation has become one of the most critical functions for managing the high volume of event messages. Practically speaking, no network management system can effectively conduct network surveillance and control procedures without some form of event correlation. In fact, event correlation has become so instrumental in identifying obscure network problems that network management software developers have begun to broaden the utility of event correlation to other aspects of network management, such as performance configuration, testing, security, and service quality management.
An event, in the context of event correlation reflects a change in the state of an object, system or process. System internal events, e.g. failures, may be manifested by associated external eventsxe2x80x94alarms. However, in very many cases internal failures are not signaled by any alarms at all. The situation of an opposite phenomena arises with too many alarms, generated by cascaded network element failures caused by a single root failure. In this situation appropriate alarm correlation and filtering methods should be applied in order to detect the root cause of the xe2x80x9calarm stormxe2x80x9d. Event correlation is the process of observing a series of events that occur over a period of time and then interpreting the events. The act of interpreting the events ranges from a simple task of event compression to a complex pattern-matching operation.
A more detailed discussion of the specific classes of event correlation will now be provided with reference to FIG. 1. As shown in FIG. 1, the classes of event correlation include: compression, filtering, suppression, count, escalation, generalization, specialization, temporal relation, and clustering. Event compression is the task of reducing multiple occurrences of identical events into a single representation of the events. No number of occurrences of the event is taken into account. The meaning of the compression correlation is almost identical to the single event xe2x80x9ca,xe2x80x9d except that additional contextual information is assigned to the event to indicate that this event happened more than once.
Event filtering provides that if parameter, p(a) (e.g., priority, type, etc.) of alarm xe2x80x9caxe2x80x9d does not fall into the set of predefined values H then alarm a is discarded or sent into a log file. In more sophisticated cases, the value of H could be dynamic and depend on a user-specified criteria or a criteria calculated by the system.
Event suppression is a context-sensitive process in which event xe2x80x9caxe2x80x9d is temporarily inhibited depending on the dynamic operational context C of the network. The context C is determined by the presence of other event(s), network management resources, management priorities, or other external requirements. A change in C could later lead to the future reporting of the suppressed event. Temporary suppression of multiple events and the control of the order of their exhibition are two techniques for dynamic focus monitoring of the network management process.
Count is the process of counting and thresholding the number of repeated arrivals of identical events. Event escalation assigns a higher value to a parameter, pxe2x80x2(a)(usually the priority) of event a, depending on the operational context, e.g., the number of occurrences of event a in a given period of time or the number of occurrences of event a while event b is not also occurring.
Event generalization is a correlation in which event a is replaced by its super class b. Event generalization has a potentially high utility for network management because it allows a system manager to change from a low-level perspective of network events and view situations from a higher level.
Event specialization is the opposite of event generalization. It substitutes an event with a more specific subclass of the event.
Temporal relations (T) between events a and b allow them to be correlated depending on the order and time of their arrival.
Finally, event clustering allows the creation of complex correlation patterns using logical and, or, and not operators.
One approach for correlating events in complex systems is to implement a rule-based expert system to monitor event flow. Rule-based expert systems generally contain two components: (1) a working memory which represents knowledge of the current state of the system being monitored; and (2) a rule base which contains expert knowledge in the form of xe2x80x9ccondition-actionxe2x80x9d rules, also known as xe2x80x9cif-thenxe2x80x9d rules. The condition part of each rule determines whether the rule can be applied based on the current state of the working memory. It contains relations that are applied to objects or groups of object slots or tests. Within the object slots we can apply math expressions and use arithmetic relations (greater than xe2x80x98 greater than ,xe2x80x99 less than xe2x80x98 less than ,xe2x80x99 equal to xe2x80x98=,xe2x80x99 greater than or equal to xe2x80x98 greater than =,xe2x80x99 less than or equal to xe2x80x98 less than =xe2x80x99 and not equal to xe2x80x98!=xe2x80x99). The action part of a rule contains executable commands, such as: (1) assertxe2x80x94creates a new correlation; (2) support xe2x80x94adds support for an existing correlation; (3) clearxe2x80x94kills the correlation and removes it from consideration; (4) loadxe2x80x94requests data from a source; and (5) modifyxe2x80x94change state or other slot values. In other words, the condition part of each rule determines whether (or xe2x80x9cifxe2x80x9d) the rule can be applied based on the current state of the working memory; and the action part of a rule contains a conclusion (xe2x80x9cthenxe2x80x9d) which can be drawn from the rule when the condition is satisfied. A rule either recognizes some event or combination of events, or performs some correlation management function. Thus, a rule may assert, resolve, or close some other correlations. It may load a portion of the network or modify the state of a working memory network element. Creating a correlation may invoke some defined function or script or send a notification to external systems.
Event correlation systems accordingly require a sophisticated event notification method that provides an adaptable, smooth flowing reporting mechanism. These systems must also enable network management personnel to quickly analyze problems and then determine the optimal solution for restoring data flow.
One drawback of conventional event correlation systems relates to the heterogeneous nature of the networks on which they operate. Multiple protocols, data formats and transmission mediums make the identification, correlation and notification of events to geographically dispersed network elements extremely troublesome even for the most robust systems.
Another drawback of the current event correlation systems is the fact that most event correlation capabilities exist as xe2x80x9cpost factumxe2x80x9d solutions. That is, they are either built-in extensions to existing management systems, or as stand-alone external systems with weak integration, cooperation and resource sharing between other components of the network management software.
Overcoming these drawbacks requires a network management system to perform several functions including: monitor network events, associate related events with each other, infer possible root causes of faults, determine the impact of events in terms of customer traffic, present the current state of the network to various network entities, and recommend appropriate actions in a minimum time. Overcoming the current drawbacks also requires that the developed event correlation systems operate as an integral part of next-generation network management systems, as opposed to afterthought add-ons.
Systems and methods consistent with this invention create a global real-time advanced correlation environment (GRACE) that provides real-time event correlation, explanation and notification capabilities in a network management environment. GRACE is a knowledge-based event correlation system for efficiently correlating a plurality of network events and then transmitting correlated (derived) messages to various network management entities in response to an occurrence of a particular network event. The GRACE system is comprised of multiple distributed services, which are communicating via a uniform CORBA interface. The services are divided into real-time event management services and interactive knowledge/data management services. This division in the GRACE system architecture supports the need to provide fast channels for real-time event processing, while making interactive services available on an on-call basis, to provide required knowledge, models, procedures and data in support of the realtime processes.
In a preferred embodiment, the real-time services include: Network Mediation, Message Parsing, Event Correlation, and Event Notification Services. The interactive services include: Network Topology and Database Services.
The Mediation Service provides connectivity to the elements of the managed networks, such as switches, digital cross-connects, routers, etc. The incoming raw events (messages) are parsed by the Parsing Service. The Correlation Service performs the functions of real-time event pattern matching, processes event objects, topology and other data, and executes predetermined actions as described by the correlation rules.
The Event Notification Service plays a special role in the architecture by facilitating communication between the real-time components of the architecture. It enables sophisticated event passing interfaces between distributed objectsxe2x80x94the producers and consumers of events. The interfaces are mediated via event channels that allow decoupling of producers and consumers in the sense that they possess no knowledge about each other. The CORBA standard for the Notification Service, the OMG""s COSNotification Service defines several important features of the Notification Service, including asynchrony, event subscription, multicast event routing, event filtering, quality of service, and structured events. The output of one channel can be chained to the inputs of another channel to create a notification chain. Each of the nodes in a notification chain may cache events, take actions, perform some transformation on the events, and forward them along the chain. Services may in turn, select relevant events via filters. It becomes easier to replace these chained services with newer or alternate versions because the interaction is decoupled. It is easy to add supporting functions such as validation by creating a service and having it subscribe to a pre-existing channel.
One of the most fundamental changes in the architecture of telecommunication and data network management systems is the move from embedded, monolithic, and loosely coupled architectures toward distributed, open, component-based architectures. The use of standard services (components) with well-defined functionality and standard inter-component communication protocols allows the building of open, scalable, and customizable systems. The encapsulation of the idiosyncrasies of components and easy addition, replication, and replacement of components provides an effective environment for developing, multi-paradigm, fault-tolerant, and high-performance systems. Various middleware technologies can be used for building the infrastructure of distributed network event management systems, including CORBA, DCOM, and Java RMI. While this specification describes the system as implementing the CORBA technology, it is important to note that the principles of component-based services proposed herein will be true for other middleware implementations.
The basic framework for component-based service is envisioned as a multilevel hierarchy of services, where services at a higher level are built from component services. As shown in FIG. 2, the present invention utilizes five levels of these systems: system, domain, application, customer and integrated services.
The System Services include the set of services, which define basic functions to identify objects, to store and retrieve them, and to define relations and processes between them. Examples of the nature of these services are CORBA system services; such as COSNaming, COSEvent, COSNotification, COSProperty, COSLog and others. In addition, the System Level Services might include scripting services, e.g., Tcl, Perl, and Java scripting services. The System Services form the core set of distributed services that are used for building the next level of Domain Services.
The Domain Services layer contains services, whose functionality and implementation are oriented toward specific domain tasks. Some of the most frequently used Domain Services include Event Interpretation, Event Correlation, Configuration (Topology), OLAP (On-Line Analytical Processing), Data Visualization, and Data Mediation Services.
Application Level Services are significant operational components built from the Domain Level Services. They perform (system, network and service) surveillance, alarm and fault management, quality of service (QoS) management, billing, and other application oriented functions.
Customer Level Services include a functionally complete set of services set of services, which have value from a customer perspective. Integrated Services are packages combined from the Customers Level Services.
The general event correlation/management system architecture is built upon distributed services (components) discussed above. In the preferred embodiment of the subject invention, the following generic features of the architecture have been implemented: (1) encapsulation of implementation idiosyncrasies of the different components; (2) the use of a standard event specifications and event passing protocols; and (3) adoptation of a common knowledge/data transportation format (XML).
These features permits one to build customized management systems of different functionality, scale, and complexity. Different instances of the domain level services can be used, as long as they all satisfy overall functional and data semantic constraints. For performance or functional reasons, multiple processes of the same service could be launched. For example, a hierarchy of event correlation processes could be created. This hierarchy could be used to implement a multilevel system management paradigm, e.g., to implement local and global correlation functions.
In accordance with one aspect of the present invention, users are permitted to define correlation rules graphically as finite state machines (FSM). This is particularly useful in situations where the entire problem set naturally lends itself to a finite state representation. Each FSM has a finite number of states and changes from one state to another when an input or stimulus is applied to the machine. A state is defined as a stable condition in which the entity or FSM rests until the next stimulus or input is applied. Each input also causes the FSM to generate an observable output. In this case, FSMs are manifested by a set of state values associated with a given NE slot and a set of transitions and associated patterns for moving between these states. FSMs may be implemented as multiple rules but managed as a single object. Rule condition patterns will be associated with transitions between states. These rules may not be directly visible to the user, and they will typically indicate the current state as the first condition and the desired goal state as the action. New messages, timeouts, or other asynchronous events may drive the state machine to other states. The states of a specific state machine will be stored as a slot value of a NE. Different state machines may exist for a given NE but will use different state slots. Because the values of these state slots will-be visible outside of the state machine, it is possible to implement nested machines or a machine driven by the states of multiple NEs or state machines. It is also possible with the existing network model to define a NE which only contains global states. The NE class describes the NE types that exist in the domain, and are used to describe the actual NEs.
In accordance with another aspect of the present invention, users are permitted to establish rule sets which are collections of rules, FSMs and other rule sets. This allows named subsets of the global knowledge base to be created. Rule sets may be assigned priorities that may be used to prefer rules in a specialized set over those in a default set. Rule sets may also contain other meta information such as creator, modification date, textual description, etc. Consistency checks may be performed for a rule set to insure compatibility between selected rules.
In accordance with yet another aspect of the present invention, event correlation methodologies are applied to the task of information management on the Internet. Provided as an Internet-based service to any client, information correlation procedures will perform a variety of functions, e.g. stock market information correlation, home security information correlation, and health care information correlation. More specifically, real-time correlation of different stock market information sources over the Internet could potentially transform novice stock market enthusiasts into experienced Wall Street analysts. Any client or day trader using an Internet browser could specify sources, select (customize) correlation methods and define the mode of correlation delivery (Internet, pager, phone, etc.) The stock market information correlation system would then take care of the rest. As another example, data could be collected from emergency care patients or outpatients using attached data sensors. The data would be correlated into more meaningful indicators and warning signs for delivery to doctors or to other health care professionals.
Additional objectives, features and advantages of the invention are set forth in-the following description, apparent from the description, or may be learned by practicing the invention. Both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed.