1. Field of the Invention
The present invention relates generally to communications networks, and in particular to a method and apparatus for resolving faults in such networks. Within this disclosure, the term "communications network" is used to refer to any type of digital communications system, of which a computer-based, local area network or a computer-based, wide area network are examples.
2. Discussion of the Related Art
All communications networks experience faults during network operation. Faults, as used in this disclosure, may include a failure of hardware portions of the communications network, such as workstations or peripheral devices and failure of software portions of the network, such as software application programs and data management programs. In small stable homogeneous communications networks (i.e., those in which all of the equipment is provided by the same vendor and the network configuration does not change), management and repair of network faults is relatively straightforward. However, as a communications network becomes increasingly large and heterogeneous (i.e., those in which different types of equipment are connected together over large areas, such as an entire country), fault management becomes more difficult.
One of the ways to improve fault management in large communications networks is to use a so-called "trouble-ticketing" system. This system provides a number of tools that can be used by network users, administrators, and repair and maintenance personnel. The basic data structure, a "trouble ticket," has a number of fields in which a user can enter data describing the parameters of an observed network fault. A trouble ticket filled out by a user may then be transmitted by, for example, an electronic mail system to maintenance and repair personnel. A trouble ticket describing a current network fault that needs to be acted on is called an "outstanding trouble ticket". When the network fault has been corrected, the solution to the problem, typically called a "resolution" is entered into an appropriate data field in the trouble ticket. When a network fault has been resolved, the trouble ticket is said to be completed. The system provides for storage of completed trouble tickets in a memory and thus a library of such tickets is created, allowing users, administrators, and maintenance and repair personnel to refer to these stored completed trouble tickets for assistance in determining solutions to new network faults.
The trouble-ticketing system thus provides a convenient, structured way of managing fault resolution and for storing solutions to network faults in a manner that allows this stored body of knowledge to be accessed and applied to outstanding communications network faults. An example of a trouble-ticketing system is the ACTION REQUEST SYSTEM, developed by Remedy Corporation, Mountainview, Calif., and sold by Cabletron Systems, Inc., Rochester, N.H.
A structured trouble-ticketing system, however, does not provide a complete solution to the fault management problem. For time-critical network services, the downtime that elapses from the observation of a network fault, the submission of a trouble ticket, to the completion of the trouble ticket can be expensive. Downtime can be reduced by providing a communication link between a network fault detection system and a trouble-ticketing system. The communication link allows fault information collected by the fault detection system to be transmitted to the trouble-ticketing system in the form of an automatically-generated and filled out trouble ticket. The trouble-ticketing system then manages communication and workflow among the network administrator, support staff, and end-users, in the normal manner to resolve the outstanding trouble ticket.
Although this solution allows trouble tickets to reach the fault management system and appropriate maintenance and repair personnel more quickly, it does not reduce the time necessary to resolve an outstanding fault. A maintenance and repair person is still required to research and resolve the outstanding fault. This is not only time-consuming, but expensive as well.
To reduce the time in which faults are resolved, artificial intelligence systems may be used to assist in resolving the outstanding trouble ticket. In existing systems that make use of artificial intelligence in this manner, fault resolution expertise is represented using a rule-based reasoning (hereinafter RBR) method.
A typical RBR system includes a working memory, a rule-base, and a control procedure. The working memory typically contains a representation of characteristics of the network, including topological and state information. The rule-base represents knowledge about what operations should be performed when the network malfunctions. If the network enters an undesirable state, the control procedure selects those rules that are applicable to the current situation. Of the rules that are applicable, a predetermined control strategy selects a rule to be executed. A rule can perform tests on the network, query a database, provide commands to a network configuration management system, or invoke another expert system. Using results obtained after executing a rule, the system updates the working memory by asserting, modifying, or removing working memory elements. The RBR system continues in this cycle until a desirable state in the working memory representing a desirable state of the network is achieved. Examples of RBR systems for network management may be seen in Expert Systems Applications in Integrated Network Management, edited by E. Erickson, L. Ericson, D. Minoli and published by Archtech House, Inc., 1989.
Constructing an RBR fault resolution system requires defining a description language that appropriately and completely represents networking conditions (the "domain"), extracting expertise from persons with expertise in the network ("domain experts") and/or trouble-shooting documents, and representing the expertise in the RBR format. This procedure requires several iterations of a so-called "consult/implement/test" cycle in order to achieve a correct system. In the consult/implement/test cycle, an expert is interviewed to determine his or her fault resolution methodology, the methodology is implemented in a rule or rules that the system can process, and the rules are tested. If the conditions or domain in which the RBR system operates remains relatively stable, once a correct system is achieved, minimal maintenance is required. However, if the system is used to resolve faults in unpredictable or rapidly changing domains, two problems typically occur. First, the RBR system suffers from the problem of "brittleness" Brittleness means that the system fails when it is presented with a novel problem for which it has no applicable rules. A cause of system brittleness is that the system cannot adapt existing knowledge to a novel situation or cannot gain new information from novel experiences to apply in the future. The second problem is commonly known as a "knowledge acquisition bottleneck". The knowledge acquisition bottleneck occurs when a knowledge engineer tries to manually modify the rule-base by devising special rules and control procedures in order to deal with changes, new parameters, or other unforeseen situations. As a result of these modifications, the RBR system typically becomes unwieldy, unpredictable, and unmaintainable. Furthermore, if the domain in which the RBR system operates is a rapidly changing one, the system can become obsolete in a relatively short period of time.
Therefore, an object of the present invention is to provide a method and apparatus for resolving faults in communications networks that learns from prior fault resolution scenarios and offers solutions to novel network faults based on past resolution scenarios.
Another object of the present invention is to provide a method and apparatus that applies so-called case-base reasoning (hereinafter "CBR") to fault management and resolution in communications networks.
Still another object of the present invention is to provide a method and apparatus for automatically resolving faults in communications networks.