A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
1. Technical Field of the Invention
This invention relates to software fault management, and, more particularly, to an intelligent multi-agent system for software fault management in a radio telecommunications network.
2. Description of Related Art
Expert systems are computer programs employing programming techniques found in the field of Artificial Intelligence known as knowledge-based systems. These computer programs are designed to apply formal representations of domain knowledge or expertise to solve problems. Symbolic descriptions (e.g., in the form of rules, frames, predicate logic, etc.) of this expertise characterize the definitional and empirical relationships in a domain and the procedures for manipulating these descriptions. This approach to computational models has proven extremely useful in automating complex tasks normally accomplished by human experts.
Compared to conventional programming methods, the emphasis in developing expert systems is placed on processing information at the knowledge-level rather than at the data-level. Knowledge is distinguished from data because of its inferential capacity which allows an information processing agent--the inference engine--to navigate from one set of data to another, for example: from a set of observations to the identification of problem symptoms; from a set of symptoms to a diagnosis; or from a set of diagnostics to a recovery plan of action. In each of these examples, numerous and intricate reasoning steps or inference procedures may be required to arrive at final conclusions. These procedures are generated dynamically as the inference engine of a knowledge-based system matches the current inputs to relevant elements in the knowledge base. This feature provides the means to re-assess the state of a situation during each cycle of a reasoning mechanism. As a result, a system can react to a dynamic situation more readily than conventional programs.
Today's cellular telecommunications networks are becoming increasingly complex in nature with many interworking nodes. Suppliers of telecommunications switching equipment may have several significantly different types of systems based on a variety of technologies, with several versions of each spread over hundreds of interworking nodes throughout the world. In addition, the need to constantly add new features leads to a rapid increase in system size and complexity. Adding even more complexity is the need to develop new trouble shooting tools. Taking this into account, and the fact that the maintenance of existing products is rapidly growing in volume and cost, it is imperative to drastically reduce the number of trouble reports and to improve response time.
The real-time nature of today's mobile telecommunication networks adds to the difficulty of the fault management task. For example, a diagnostic system must be able to handle alarm notifications flow as quickly as the average speed at which they are generated. The maintenance of an accurate model of the mobile network configuration is critical for the fault management task. A good knowledge of the faults to be processed, as well as their dynamic features, are also of importance. For example, the severity of a fault can depend on the current state of the traffic load or a particular time or day of the week, and the fault's assigned priority depends on its severity. Filtering and correlation are two major aspects to be considered to make easier the separation of the principal fault from its side effects. Indeed, the physical and "air" interconnections of network components and the logical dependencies between the distributed software modules lead to multiple manifestations of the same fault. Efficient tests must be performed automatically and their results consistently interpreted to help the diagnosis and decision making processes.
Finally, current telecommunication systems contain a high amount of software modules which can be one of the sources of the faults occurring within the network. Testing of such large software systems is an example of a resource and time consuming activity. Applying equal testing and verification efforts to all parts of a software system is obviously cost prohibitive and a source of operational delay. Therefore, one needs to be able to identify fault-prone modules so that testing/verification efforts can be concentrated on these classes. This will optimize the reliability of a software system with minimum cost and, above all, optimize the fault identification process. Quantitative models can be used to predict which components are likely to contain the highest concentration of faults based on adequate software metrics, and the log of faults found by testers and clients of a software system. To develop such systems, a complete understanding of network management principles is required.
Network management means deploying and coordinating resources in order to plan, operate, administer, analyze, evaluate, design and expand communication networks to meet service level objectives at all times, at reasonable cost and with optimum capacity. Network management developments for mobile networks have almost the same objectives as for wired networks, the main objectives being to ensure good operation and service provisioning. Several standards have been developed for the management of networked systems in the scope of ISO/OSI network management activities. For telecommunication networks, the ITU (International Telecommunication Union) provides a guideline for the definition of the Telecommunication Management Network (TMN). A de-facto standard for the management of TCP/IP networks is the SNMP management protocol which is very widely used. In conformance with these standards or in a proprietary way, several developments have been achieved by both the industry and the research community in the area of wired network management. However, very few works are addressing the management of mobile networks. The actual challenge in this subject domain is the provision of an intelligent and automated management support system to improve availability, quality, and commercial success. This is needed for both wireless and wireline networks. The following sections review generic network management, the network management functionality specific to mobile networks which results from the wireless nature of these networks, and recent developments in automated fault management systems.
Generic Network Management
Five standard management functions are defined by ISO/OSI management: configuration, fault, security, accounting, and performance management. In the context of mobile networks, these functions apply together with some additional functions that are more specific to the wireless nature of these networks.
One of the most important requirements to be addressed by general purpose fault management systems is the ability to quickly identify the root cause of faults in the network and fix them. This is valid for mobile radio networks where an efficient fault management system should reduce the outage time on radio and other communication and commuting resources. This can be achieved by means of an automated analysis of the alarms generated by different components of the mobile system, and by an automated diagnosis process enabling the fault management system to quickly detect, locate and correct the source fault. The overall process involves filtering and correlation of alarms, and performing diagnostic tests and performance measures.
Basically, fault management deals with the identification of faults and their side effects in the network, their isolation, correction, and the restoration of the network to a desired state. The ultimate aim is to increase the network reliability and availability. Such a system must have enough capabilities to rapidly identify the cause of a fault, isolate the source of the fault, repair the faulty component and restore the network to its normal operational state. More globally, fault management is a collection of activities that are necessary to maintain a desired level of network services. In order to satisfy this requirement, these activities must, as completely as possible, guarantee the detection of all problems in the network and recognize the degradation of performance.
Fault management can be divided into four phases: monitoring, alarm analysis, fault localization, and fault recovery. Monitoring is needed for all management activities, including performance management, configuration management, and fault management. It is an essential means for obtaining the information required about network and system components. During monitoring, the behavior of the system is observed (event detection) and monitoring information is gathered and disseminated (notifications). Monitoring information is processed and utilized to make management decisions and to perform the appropriate control actions on the system.
In the scope of fault management, monitoring information comprises alarms generated by the managed resources and/or sent by the monitoring agent to notify the occurrence of faults. The processing of these alarms consists of discarding superfluous and non-relevant event notifications. Alarm analysis can be divided into two main activities that are filtering and correlation. Alarm filtering discards lower priority alarms or stores them in a log file. Alarm correlation recognizes commonalities between alarms and discards non-significant ones and side effects.
Fault diagnosis (and localization) consists of performing appropriate test sequences in order to locate the fault origin by reducing the number of suspicious components to a limited set containing, optimally, a single faulty component. Fault recovery consists of restoring the system to its normal operation either by isolating the faulty component or by repairing it. Alarm analysis and fault diagnosis are particularly important activities.
Alarm correlation consists of detecting commonalties between alarms, determining the principal alarms, and discarding their side effects (e.g., redundant alarms). This can vary from simple message filtering and redundant alarm suppression to more sophisticated alarm compression and generalization/specialization. The correlation process also reduces the number of suspicious components. The fault localization process can then be based on the remaining non-redundant alarms. The correlation process is iteratively executed by updating a list of potential faults and a list of suspicious components according to the newly received alarms and received information about the components states. A component is declared potentially faulty (highly suspicious) when a fault pattern involving this component is recognized.
Based on results of the alarm correlation process, a fault diagnosis is made. If the faulty component is not accurately identified, appropriate test sequences are repeatedly selected and performed on the remaining highly suspicious components. Test results are analyzed so as to locate the exact set of faulty components. Then, the operational attributes of the faulty components are set to appropriate values (e.g., "Abnormal", 0.0,0%, etc.). In the case of progressive degradation, these attributes are incrementally updated (e.g., "Warning", 0.35, 35%, etc.). When many levels of the overall hierarchy are concerned with the detected fault, the diagnosis process may involve all these levels.
A top down approach is usually used to refine the diagnosis within a given domain by delegating the fault localization responsibility to lower level domains which are more likely to contain the faulty component. This downward delegation can be applied recursively through many levels of the aggregation hierarchy with less suspicious components at each level and by executing more specialized test sequences. Each domain reports to its superiors the results of its diagnosis. The top down approach is often suitable when the fault is detected at the level of a given domain. A bottom up approach is used to notify concerned higher level domains and possibly the diagnosis result corresponding to this fault. This can be useful to prevent fault propagation and to set up the isolation/repair procedures. In addition, a peer-to-peer cooperation between managers of the same hierarchical level may be necessary to provide a consistent diagnosis. This is more likely the case when the potential faulty component is managed within two or more domains.
The configuration management function mainly handles initial setting of system data, their management (e.g., data update, inventory, etc.) and system configuration (e.g., the system topology). The ultimate aim is to provide consistent system data for each network element in order to guarantee a high network quality and thus customer satisfaction. More precisely, configuration management involves the availability of configuration maintenance data, version control, examination of relevant system data in network elements, analysis of regularly occurring problems, and cooperation with fault management processes.
For these configuration management activities, a uniform data base and/or unique interfaces for the exchange of data is necessary. The use of such common data base, often called the technical operational network system data base, optimizes data access procedures and simplifies the exchange of relevant and consistent data between the various involved departments (network planning, system design, services operation, etc.).
Software management includes a wide range of tasks and can be viewed, to a certain extent, as part of configuration management. Software management includes the management of existing software versions in operation, the installation of new hardware with the latest software versions, and controlling software improvements. Finally, the resolution of software problems is a major task in the software management process which includes the problem analysis over a certain period of time and over regional borders while maintaining the consistency of the technical operational database.
Mobile Network quality management deals with the recognition and tracing of the main failure reasons, the definition of these failure reasons and their effects on the network, and the optimization of procedures to avoid and eliminate sources of failure as much as possible. Network quality measurement consists of measuring the quality of services, comparing them with competitors, realizing random or scheduled measurements, examining customer complaints, and describing measurement results and usage. Based on these quality measurements and performance/statistics reports, network optimization can be performed (e.g., regular replanning of the cells, fields, regions and the complete network).
The help desk is the interface between the customer service center and the outage system. It is mainly responsible for filtering and processing of network problem data, receiving and analyzing customer problem reports and complaints, initiating appropriate actions to resolve the problem, and the global coordination of the problem resolution process. In addition to service maintenance, the help desk provides support for existing and new services installation and network configuration.
Operational network control consists of maximizing network availability and traffic throughput on an hour-by-hour basis across the whole network. It performs a large number of tasks mainly in an advisory capacity or acting as an agent for other departments, e.g., certain regional problems outside normal working hours. Some of its other activities are: the allocation of priorities to major problems; the evaluation of the impact of major faults on network service; the sorting and handling of major problems; the dynamic monitoring of the mobile system; the provisioning of a management interface for operators; the technical management support and advice to customers interfaces outside the normal hours; and the provision of daily reports of major problems.
System maintenance involves dynamic network analysis, network technical support, and central preventive maintenance.
Mobile Network Management
Many of the management functions described previously apply to all types of networks (i.e., wired, wireless network, and their interconnections). Some management functions are specific to mobile networks due to the wireless nature of these networks. These are mainly: radio resources management; mobility management; and radio communication management. In a mobile network, radio transmission constitutes the lowest functional layer. In any telecommunication system, signaling is required to coordinate the necessarily distributed functional entities of the network. The transfer of signaling information in GSM for example follows the layered OSI model. On top of the physical layer is the data link layer providing error-free transmission between adjacent entities, based on the ISDN's LAPD protocol for the Um and Abis interfaces, and on SS7's Message Transfer Protocol (MTP) for the other interfaces. It is the functional layer, above the data link layer, that is responsible for Radio Resource (RR) management, Mobility Management (MM) and Call Management (CM).
The RR management functionality is responsible for providing a reliable radio link between mobile stations and the network infrastructure. The main functional components involved are the mobile station (MS), and the Base Station (BS) subsystem, as well as the Mobile Switching Center (MSC). The RR management function establishes and allocates radio channels on the Um interface between the MS and BS, as well as the establishment of A-interface links between the BS and the MSC. Handover (handoff) procedures, an essential element of cellular systems, is managed at this layer. Several protocols are utilized between the different network elements to provide RR functionality. An RR-session is always initiated by a mobile station through the access procedure, either for an outgoing call, or in response to a paging message. The details of the access and paging procedures, such as when a dedicated channel is actually assigned to the mobile, and the paging sub-channel structure, are handled by the RR management. Also handled here is the management of radio features such as power control, discontinuous transmission and reception, and timing advance.
Mobile network management standards adopted the concept of Telecommunication Management Network (TMN) defined in ITU Recommendation M.3010. TMN has been successfully applied for the management of GSM networks for example. Models for the management of a GSM network also exist in standards. In particular, the application of TMN principles have consisted of the definition of Q3 interfaces between operating systems (OSs) and network elements (NEs) in mobile networks. The various functional blocks (MSC, BS, etc.) are combined in a NE (e.g., MSC Function and Visitor Location Register (VLR) Function in a single NE-MSC/VLR).
Automated Fault Management
There are several existing knowledge-based and artificial intelligence (AI) techniques that can be used for fault diagnosis. Five categories relevant to fault diagnosis are identified: fault-based techniques, model-based techniques, case-based reasoning techniques, machine learning for knowledge acquisition, and integrated diagnostic techniques. A description of the techniques and how they apply to diagnosis follows.
Fault-Based Diagnostic Techniques
Fault-Based Reasoning (FBR) is used in many diagnostic systems and reasons on the basis of common faults and troubleshooting to isolate a problem and suggest a subsequent repair. The knowledge in these systems is primarily based on repair manuals and heuristics (rules of thumb) of experienced technicians. The knowledge is often represented as rules or frames in diagnostic networks or troubleshooting hierarchies.
At the top level of the hierarchy is the general knowledge representing a problem with the device. This general problem is refined systematically until the terminal nodes of the hierarchy, which represent physical repairs or adjustments to the device components, are reached. After these repairs are achieved by a human technician, some systems retest to confirm that the fault or faults diagnosed by the system are resolved by backtracking through tests in the hierarchy.
Two major problems with FBR are acquiring the knowledge base and dealing with new faults. Fault-based reasoning systems do not learn new knowledge as they are used and thus are inadequate at detecting novel faults. Also, once encoded the knowledge is difficult to update and maintain. As a result, the case-based and model-based reasoning approaches were developed. Despite its shortcomings, FBR has remained an attractive way of developing diagnostic tools. There have been many successful systems based on FBR.
Model-Based Diagnostic Techniques
Model-based diagnostic techniques describe reasoning on the basis of quantitative or qualitative device models to diagnose failures. Quantitative models include simulations and numerical models. Qualitative models include structural, behavioral, and functional black box models.
Model-Based Reasoning (MR) for diagnosis concentrates on reasoning about the expected and correct functioning of a device. Models in MR range from quantitative to qualitative ones and all attempt to accurately approximate device behavior. Once a device model is stabilized, the observed behavior of the device can be predicted. If a discrepancy in behavior is detected, possible candidates, based on assumed components faults, can be generated using assumptions that describe correct model behavior. Sequential diagnosis is used on choose observations, augment a prediction for the candidate faults, and update the list of candidates until a dominant candidate is found.
Although model-based reasoning is less mature than FBR, recent applications developed using MR techniques have proven that it is a viable technique for diagnosis. However, MR is applicable only where a sufficiently good model can be built. Also, MR systems are computationally expensive and have an exponential increase in search complexity as they attempt to detect a fault for a complex device. Also, models are approximations of an artifact and as a result may not accurately illustrate its faults.
Case-Based Reasoning Techniques
Case-Based Reasoning (CBR) techniques examine past cases and use the results of past case solutions to make recommendations to the user. Although not widely applied to diagnostic applications, this technique is quite relevant to diagnosis.
CBR is the ability to reason on the basis of past problem solutions. CBR allows a system to learn from experience and build up an episodic memory, much like a human. Key issues in achieving this include indexing cases, representing features, adapting cases to new problems, repairing a case that has failed in providing a solution, and generalizing cases for learning in CBR. Recent implementations have included CBR shells. CBR has been applied successfully to many problems, including negotiation, planning, design, and cooking.
Case-based reasoning has been combined with other techniques in AI such as FBR, MBR, simulators, explanation-based learners, and genetic algorithms in an attempt to make CBR more flexible. CBR has had limited application in diagnosis because FBR can be viewed as a form of organized CBR. Diagnostic systems may be able to reason more quickly if they have a case-based component, since CBR speeds up repetitive diagnoses. However, case-based reasoning systems are case-specific and their cases are not easy to generalize; their utility becomes a function of indexing and searching the case base.
Machine Learning for Knowledge Acquisition
Machine learning, which includes empirical and analytic learning, is a key approach in knowledge acquisition. Empirical learning focuses on learning for classification (including learning rules from data for diagnosis). Analytic learning addresses learning for problem-solving tasks. Such tasks include planning, design, natural language understanding, control, and execution. There has been an explosion of work in machine learning in recent years. It is viewed as one of the key approaches of reducing the knowledge acquisition bottleneck.
Learning using classification is one of the more mature machine-learning techniques. Classification algorithms take positive and negative instances and build classification trees that can be pruned to provide rules that represent the examples. Explanation-based learning (EBL) is a form of analytic learning that takes positive and negative examples and uses background knowledge (domain theory) to generate and generalize an explanation for the example. This is a form of speed-up learning that is used to derive generalized knowledge from specific knowledge. It is also useful in making a knowledge base more compact so that reasoning paths may be shortened.
In classification, learning rules are extracted from positive and negative examples. Classification learning has been applied to problems in diagnosis, planning and design. Explanation-based learning is speed-up learning, which implies that it is intended to learn knowledge that could help perform a task faster. Explanation-based learning has been applied to the problem of generating and refining rules for diagnosis.
Machine learning, however, remains in its infancy in addressing complex real-world learning. Machine learning for data interpretation requires the compilation of libraries of healthy and fault patterns for the performance of a device. These libraries do not provide knowledge-rich structures or justifications for device behavior or failure.
Integrated Diagnostic Techniques
Integrated diagnostic techniques are a combination of knowledge-based techniques for diagnosis. The following techniques are often combined:
Data analysis and interpretation, including the use of machine learning for diagnosing faults; PA1 Reasoning based on common faults and troubleshooting to isolate the problem; PA1 Reasoning on the basis of numerical or behavioral models to diagnose failures; and PA1 Examining past case solutions and using the results to diagnose new faults. PA1 (2) Several AI-based tools for alarms analysis and fault diagnostics including an expert system shell to build assistants for real-time network alarm correlation in wireline and cellular networks. PA1 (3) An expert system which allows the reception of customer trouble reports, uses a database to determine appropriate circuit tests, conducts the tests, diagnoses problems, and makes dispatch decisions. PA1 (4) An expert system dedicated to network traffic management. It receives network performance data from groups of switches, recognizes and interprets anomalies, plans solutions, and, with user approval, installs appropriate controls and monitors. PA1 (5) An expert system used for fault diagnosis and tuning of cellular networks. PA1 (6) A knowledge-based system which is an internal help desk application to help maintenance administrators use the software that predicts and reports phone-line problems. PA1 (7) A multi-agent, event-driven system which allows on-line monitoring and control for cellular networks. The system minimizes signal interference and increases equipment use in real-time.
Many researchers are developing hybrid (integrated) systems. Some systems are using model-based reasoning (MBR) to support a fault-based reasoning (FBR) system. Model-based reasoning is used to detect novel faults while FBR is used to quickly diagnose common faults. Some systems are using machine learning to extract symptoms from sensor data using data interpretation so that a FBR system can be used for diagnosis in an on-line mode. Such an approach simplifies the device monitoring since sensor data is interpreted and then relayed to a failure driven reasoner for a fast diagnosis. Other systems combine sensor data interpretation with MBR to eliminate health components from consideration in a diagnosis and are more quickly zeroing in on components whose behavior deviates from the expected behavior. Cases of previous failures are being indexed and used to speed-up diagnosis while combining case-based with fault-based reasoning. Cases of previous failures are also being used to speed-up model-based diagnosis.
A single strategy for diagnosis does not seem to be suitable, especially for complex problems. An integrated approach is superior because complex systems inevitably require real-world hybrid solutions.
Today's telecommunication networks are highly advanced, rapidly evolving and made of complex, interdependent technologies. As telecommunication networks fuse with computer networks, and as the underlying technologies continue their rapid evolution, these networks will become increasingly difficult to manage. AI techniques are needed in telecommunications, especially mobile telecommunications, for supporting the decision making process and thus allowing a high level of automation. The main advantages are to reduce the complexity of the management task and to free human operators.
The aspects of fault management covered by existing automated management systems for mobile telecommunications networks are essentially limited to fault monitoring and alarm handling. There is no complete application developed for the management of faults for the whole mobile network since emphasis has been given to the management of problems at the level of single equipment, mainly base stations.
Some of the existing fault management tools based on AI techniques are:
(1) An expert system for restoring services by automating problem diagnosis, recommending repairs, and dispatching technicians.
Like wireline telecommunications networks, mobile networks face the challenge of guaranteeing a high level of network availability and a good quality of service for customers. For that purpose, efficient, intelligent and automated management systems must be provided for the supervision and control of mobile networks. An advantage of using AI techniques for this purpose is to keep in-house the experience and knowledge acquired by human operators when these operators leave or retire. In general, it also leads to less training activities and lower personnel costs. Another advantage is that the system can evolve more efficiently as new knowledge is added and stored in the light of operational experience.
The state of the art reveals the limited coverage of automated fault management systems in mobile networks.
A number of problem areas have been identified with the current trouble shooting process and tools. In a typical scenario, more than one person is trouble shooting, and one team member (lead troubleshooter) is in charge of guiding the team. The lead troubleshooter reasons with the rest of the team on the possible root of the cause. Once the possible locations are identified a diagram is drawn by hand to obtain a better visual understanding of the problem at hand. An iterative process follows in which the team decides on the best signal to trace given the circumstances; trouble shooting tools are utilized to manually place a trace on the signal(s) in the switch; the switch is activated to perform certain functions that activate the trace; and the trace is downloaded and analyzed by the team members for a solution. If no solution is found, the process is repeated with different signals being traced.
The current trouble shooting process requires a great deal of human intervention, which can lead to misinterpretation and error. The current process is of a reactive nature; trouble shooting takes place only after a fault has caused an error or a failure in the system. This means that the customer is experiencing problems, and there is pressure to find a solution as quickly as possible.
In addition to requiring a great deal of human intervention, the process is knowledge-intensive. Given the complexity and size of the software, understanding and reasoning about the system requires considerable effort. Good trouble shooting expertise can only be mastered after years of front-line trouble shooting. Filtering the large volumes of data and choosing the correct tool from the large set of tools available also cause problems. Due to the vast number of possible scenarios, there is no explicit, global trouble shooting methodology that can be utilized by troubleshooting team members. Clearly, there is a definite need for more effective handling of both hardware and software faults.
Although there are no known prior art teachings of a solution to the aforementioned deficiency and shortcoming such as that disclosed herein, U.S. Pat. No. 5,408,218 to Svedberg et al. (Svedberg) and U.S. Pat. No. 5,297,193 to Bouix et al. (Bouix) discuss subject matter that bears some relation to matters discussed herein. Svedberg discloses a model-based alarm coordination system which coordinates primary and secondary alarm notifications in order to ascertain whether they are caused by a single fault or multiple faults in a complex electronic system. The alarm coordination function is part of a larger overall Fault Management Support (FMS) system. The procedure disclosed in Svedberg, therefore, may be utilized within the SFM system of the present invention to perform the fault localization process, but Svedberg does not disclose an overall SFM system providing for proactive monitoring of the cellular network, and trouble shooting expertise and assistance.
Bouix discloses a wireless telephone network which includes a centralized service management system linked to fixed stations by Integrated Services Digital Network (ISDN) links. The fixed stations detect faults and transmit maintenance messages over the ISDN links to the centralized service management system. However, Bouix does not disclose an overall SFM system providing for proactive monitoring of the cellular network, and trouble shooting expertise and assistance.
Review of each of the foregoing references reveals no disclosure or suggestion of a system or method such as that described and claimed herein.
In order to overcome the disadvantage of existing solutions, it would be advantageous to have a SFM system which increases the level of automation of system operation and maintenance activities, thus reducing the turnaround time, the associated cost, and releasing as much as possible human operators and trouble shooting experts. Such a SFM system provides for proactive monitoring of the cellular network, and trouble shooting expertise and assistance, thereby anticipating and preventing catastrophic impact of faults on cellular network services. The present invention provides such a system, enabling cellular system operators to face the challenge of increasing complexity of software management in current and future cellular switching systems.