The present invention relates to methods of processing data from communications networks, systems for processing data from communications networks, and methods of diagnosing causes of events in complex systems.
In complex systems such as communication networks, events which can affect the performance of the network need to be monitored. Such events may involve faults occurring in the hardware or software of the system, or excessive demand causing the quality of service to drop. For the example of communication networks, management centres are provided to monitor events in the network. As such networks increase in complexity, automated event handling systems have become necessary. Existing communication networks can produce 25,000 alarms a day, and at any time there may be hundreds of thousands of alarms which have not been resolved.
With complex communication systems, there are too many devices for them to be individually monitored by any central monitoring system. Accordingly, the monitoring system, or operator, normally only receives a stream of relatively high level events. Furthermore, it is not possible to provide diagnostic equipment at every level, to enable the cause of each event to be determined locally.
Accordingly, alarm correlator systems are known, as shown in FIG. 1 for receiving a stream of events from a network, and deducing a cause of each event, so that the operator sees a stream of problems in the sense of originating causes of the events output by the network.
The alarm correlator shown in FIG. 1 uses network data in the form of a virtual network model to enable it to deduce the causes of the events output by the network. Before the operation of known alarm correlator systems is discussed, some details of how alarms are handled within the network will be given, with reference to FIG. 2. Several layers of alarm filtering or masking can occur in between a device raising an event, and news of this event reaching a central system manager. At the hardware element (HE) level, the system would be overwhelmed, and performance destroyed if every signal raised by hardware elements were to be forwarded unaltered to higher layers. Masking is used to reduce this flood of data. Some of the signals are always suppressed, others delayed for a time to see if a higher criticality signal arises, and suppressed if such a signal has already been sent.
Some control functions may be too time critical to be handled by standard management processes. Accordingly, either at the hardware element level, or a higher level, some real time control may be provided, to respond to alarms. Such real time control (RTC) has a side effect of performing alarm filtering. For example, a group of alarms indicating card failure, may cause the real time controller to switch from a main card to a spare card, triggering further state change modifications at the hardware element level. All this information may be signalled to higher levels in a single message from the RTC indicating that a failure and a handover has occurred. Such information can reach the operator in a form indicating that the main card needs to be replaced, an operation which normally involves maintenance staff input.
A node system manager may be provided as shown in FIG. 2, to give some alarm filtering and alarm correlation functions. Advanced correlation and restoration functions may be located here, or at the network system management level.
In one known alarm correlation system, shown in U.S. Pat. No. 5,309,448 (Bouloutas et al), the problem of many alarms being generated from the same basic problem is described. This is because many devices rely on other devices for their operation, and because alarm messages will usually describe the symptom of the fault rather than whether it exists within a device or as a result of an interface with another device.
FIG. 3 shows how this known system addresses this problem. A fault location is assigned relative to a device, for each alarm. A set of possible fault locations for each alarm is identified, with reference to a stored network topology.
Then the different sets of possible fault locations are correlated with each other to create a minimum number of possible incidents consistent with the alarms. Each incident is individually managed, to keep it updated, and the results are presented to an operator.
Each of the relative fault locations are internal, upstream, downstream, or external. The method does not go beyond illustrating the minimum number of faults which relate to the alarms, and therefore its effectiveness falls away if multiple faults arise in the selected set, which is more likely to happen in more complex systems.
Another expert system is shown in U.S. Pat. No. 5,159,685 (Kung). This will be described with reference to FIG. 4. Alarms from a network manager 41 are received and queued by an event manager 42. After filtering by an alarm filter 43, alarms which are ready for processing are posted to a queue referred to as a bulletin board 44, and the alarms are referred to as goals. A controller 45 determines which of the goals has the highest priority. An inference engine 46 uses information from an expert knowledge base 47 to solve the goal and find the cause of the alarm by a process of instantiation. This involves instantiating a goal tree for each goal by following rules in the form of hypothesis trees stored in the expert knowledge base. Reference may also be made to network structure knowledge in a network structure knowledge base 48. This contains information about the interconnection of a network components.
The inference process will be described with reference to FIG. 5. First a knowledge source is selected according to alarm type. The knowledge source is the particular hypothesis tree. Hypothesis trees, otherwise known as goal trees are stored for each type of alarm.
At step 51 the goal tree for the alarm is instantiated, by replacing variables with facts, and by executing procedures/rules in the goal tree as shown in step 52. If the problem diagnosis is confirmed, the operator is informed. Otherwise other branches of the goal tree may be tried, further events awaited, and the operator kept informed as shown in steps 53 to 56.
This inference process relies on specific knowledge having been accumulated in the expert knowledge base. The document describes a knowledge acquisition mode of operation. This can of course be an extremely labour intensive operation and there may be great difficulties in keeping a large expert knowledge base up to date.
A further known system will be described with reference to FIG. 6. U.S. Pat. No. 5,261,044 (Dev et al) and two related patents by the same inventor, U.S. Pat. Nos. 5,295,244, and 5,504,921, show a network management system which contains a model of the real network. This model, or virtual network includes models of devices, higher level entities such as rooms, and relationships between such entities.
As shown in FIG. 6, a room model 61 may include attribute objects 62, and inference handler objects 63. Device models 64, 65, may also include attribute objects 66, 67 and inference handler objects 68, 69. Objects representing relationships between entities are also illustrated. The device models are linked by a xe2x80x9cis connected toxe2x80x9d relationship object 70, and the device models are linked to the room model by xe2x80x9ccontainsxe2x80x9d relationship objects 71, 72.
The network management system regularly polls all its devices to obtain their device-determined state. The resulting data arrives at the device object in the virtual model, which passes the event to an inference handler attached to it. An inference handler may change an attribute of the device object, which can raise an event which fires another inference handler in the same or an adjacent model.
The use of object orientated techniques enables new device models to be added, and new relationships to be incorporated, and therefore eases the burden of developing and maintaining the system.
However, to develop alarm correlation rules for each device, it is necessary to know both what other devices are linked to the first device, and also how the other devices work. Accordingly, developing and maintaining the virtual network model can become a complex task, as further new devices, new connections, or new alarm correlation rules are added.
The invention addresses such problems.
According to a first aspect of the invention, there is provided a method of processing data from a communications network, the network comprising entities which offer and receive services to and from each other, the method comprising the steps of:
adapting a virtual model of the network according to events in the network, the model comprising a plurality of managed units corresponding to the network entities, each of said units containing information about the services offered and received by its corresponding entity to and from other entities, and having associated knowledge based reasoning capacity for adapting the model by adapting said information;
notifying one of the managed units of an event raised by its corresponding entity; and
determining the cause of the event using the virtual model.
Using service import/export for configuration of the network model, and communicating service import/export state between managed units enables a much greater degree of encapsulation to be achieved. This encapsulation enables alarm correlation rules to be developed for each managed unit without the need to understand or adapt the behaviour of all the other managed units. Adding further devices or connections to an existing model can be achieved with less disruption to other managed units and sets of alarm correlation rules.
If the managed unit concept is used at other stages in the life cycle of a system, then accurate fault behaviour can be specified at an early stage of designing a device or a network.
Other network management functions can use the knowledge developed in alarm correlation rules developed for the managed unit virtual model.
A further advantage is that diverse types of networks can be supported. The mapping of diverse managed object concepts into a single managed unit concept allows the correlator to model and correlate alarms from heterogeneous networks.
Preferably, the information about the services comprises degradation status of the services.
Advantageously the reasoning capacity comprises a set of rules representing the behaviour of the corresponding entity.
Advantageously the rules represent the behaviour of the corresponding entity under fault conditions.
Advantageously, the rules further represent behaviour of the corresponding entity under conditions of the fault in another entity which is supplying services to it.
Advantageously, the information concerning services between a given pair of the units is held in an interactor object shared by the two units. The interactor object has type representing a type of service and associated state representing degradation states of its service type. The pair of units may communicate with each other using a limited set of messages relating to a state of the interactor or to the event or to a fault state of the originating unit.
Advantageously, the step of determining the cause of the event comprises the steps of:
selecting one or more rules associated with the unit which correspond to the type of event notified,
applying the rule or rules to determine whether the cause is internal to the corresponding entity, or is a result of a degradation of services received by the corresponding entity.
Advantageously information concerning services between a given pair of units is held in an interactor object, one of said given pair being the notified unit, the method further comprising the steps of:
communicating a degradation in services to the other unit of the pair, using the interactor object,
and applying rules associated with the other unit of the pair, to determine whether the cause is internal to its corresponding entity.
Advantageously a truth value taken from a multivalued logic associated with the degradation is determined by the rules associated with the notified unit and is communicated to the other of the units. This enables both certain degradations and possible or likely degradations to be calculated and communicated, pending confirmation or contradiction from other sources, or at a later time.
Advantageously, a problem object is created, comprising a knowledge based reasoning capacity for determining whether one possible cause of the event is true, the method comprising the step of exercising the problem object reasoning capacity. The combination of treating problems as objects and modelling the network in such a way that managed units contain information about services offered and received gives rise to particular advantages. It allows the system to map more precisely a particular state on an unity, to its causes and consequence. It is more efficient to express these in terms of services because a service captures precisely information about how the managed unit operations are inter dependent. Object orientation restricts communication to that which is relevant, one of the benefits of encapsulation. Object orientation also enables inheritance, as will be discussed.
Advantageously the problem object is associated with the notified unit and the reasoning capacity comprises rules representing the behaviour of the unit under fault conditions. Advantageously the rules comprise rules for mapping a fault in the unit to degradation of services it offers. The rules may comprise rules for mapping degradation of services received to services offered, or vice versa. Also, the rules may represent behaviour of the unit under conditions of faults in a limited number of other units whose corresponding entities are functionally linked in a chain of service connections. Limiting the reasoning to local or semi local reasoning greatly facilitates the task of writing and maintaining the rules. Furthermore, fault knowledge can be separated from the specific topology of a network, thereby allowing a singly knowledge base to support a variety of customer specific network configurations.
Advantageously, if an event cannot be translated it may be broadcast to other units for translation. It may only be broadcast to a limited number of other units, whose corresponding entities are functionally linked in a chain of service connections.
Advantageously, where a plurality of problem objects are created, corresponding to different possible causes of an event, they are able to pass messages to each other. This hybrid rule and message passing system can enable faster alarm correlation compared to standard knowledge based communication between rules in a large rule base applying to many possible faults. Scalability is improved as correlation processing can be distributed.
According to another aspect of the invention a system is provided comprising processing means arranged to process data from a communications network.
According to another aspect of the invention there is provided a method of processing data from a communications network, the network comprising entities which offer and receive services to and from each other, the method comprising the steps of:
adapting a virtual model of the network according to events in the network, the model comprising a plurality of managed units corresponding to the network entities, each of said units containing information about the services offered and received by its corresponding entity to and from other entities, and having associated knowledge based reasoning capacity for adapting the model by adapting said information;
notifying one of the managed units of an event raised by its corresponding entity; and
determining consequences of the event using the virtual model.
Determining consequences of some events can assist in determining causes of other events. Another application is in service impact analysis.
According to another aspect of the invention, there is provided a method of processing data from a communications network, the network comprising entities which offer and receive services to and from each other, the method comprising the steps of:
adapting a virtual model of the network according to events in the network, the model comprising a plurality of managed units corresponding to the network entities, each of said units containing information about the services offered and received by its corresponding entity to and from other entities, and having associated knowledge based reasoning capacity for adapting the model by adapting said information;
notifying one of the managed units of an event raised by its corresponding entity; and
wherein the information about the services comprises degradation status of the service.
This enables the causes and consequences of events to be determined precisely and efficiently.
Preferred features may be combined, and combined with any of the aspects of the invention as appropriate, as would be apparent to a skilled person.