1. Technical Field
The present invention relates in general to an improved telecommunications networks, and in particular to a method and system for fault prevention in telecommunications network. Still more particularly, the present invention relates to a method and system for fault prediction and proactive maintenance in a telecommunications network.
2. Description of the Related Art
In a highly competive market, there is a demand for highly reliable networks and networks that are easy to monitor. This includes the detection of faults in real-time or near real-time with minimal manual intervention.
Modern telecommunications networks are growing fast in both size and complexity. A Network Alarm correlation system improves network reliability through network surveillance and fault management. Traditionally, alarms (also referred to as logs and utilized interchangeably throughout this document) report status and abnormalities in the network to the Network Operations Centers (NOC) manned by network domain experts. These alarms are generated by the Network Elements (NE). NEs produce thousands of alarms a day, where a single failure often generates multiple alarms and the same alarm may be raised by different failures. Currently, a burst of alarms during a major network failure may exhibit 40-50 alarms per second. These alarms that are provided to protect the network, due to its sheer volume, may cause network operators to overlook alarms unnoticed, notice them too late, and incorrectly interpret groups of alarms, which results in frequent and undetected network failures. Thus, the task of network failures, faults and surveillance is very difficult. Added to this is the ever increasing number of alarms introduced to the system by new software loads.
Previously, when network maintenance was entirely dependent on network domain experts, all network logs flowed directly to the NOC. As the network growth increased exponentially, so did the number of logs flooding to the NOCs. Due to the frequent inability to foresee the failures, NOC staff operates in a reactive mode to failures already occurred, rather than being in a proactive mode to contain failures in their initial stages. Such frequent network failures affect the revenues of service providers and results in low customer satisfaction. Thus, the task of identifying the faults and correcting them before it is too late is a critical task of network management.
The currently available tools for log detection are nodal management tools. These tools often can not perform root cause analysis or prediction, lack the capability to predict faults, require manual monitoring of the network, and are reactive in nature.
The International Telecommunications Union (ITU) has a five-layered model known as Telecommunications Management Network (TMN) put forth to address this problem. TMN includes (1) the business management layer, (2) the service management layer, (3) network management (NM) layer, (4) element management (EM) layer and (5) the network element (NE) layer.
Fault prediction applications exist for the EM layer and NE layer which are vendor-specific. The NM layer is the domain of the equipment manufacturer and because of this it is difficult to integrate multi-vendor products.
Faults in the NM layer are due to the impact of external factors that could be in the form of busy hour traffic, cable cut during road construction, microwave link failure due to bad weather, etc. A failure results in network downtime, loss of revenue to service provider, and reduction of customer satisfaction. Thus, the task of identifying these and other faults and correcting them before it is too late is a critical task of network management.
With this backdrop of a chaotic network management structure, calls for equipment vendors to be more proactive to resolving these issues with efficient network management systems has been growing. The answer from many equipment manufacturers has been to build and deploy alarm correlation systems described above. These alarms correlation systems are placed between the network and the NOC.
The older (first generation) alarm correlation systems are more domain expert intensive. Here the fault patterns observed by the experts are implemented as rules in an expert system. The information required for building an expert system is readily available (in the form of experts""knowledge).
However, as discussed above, such first generation systems are incomplete due to the nature of telecommunications today. Because of the complexity of code and the number of logs that may be generated in a typical fault scenario (around 5,000 a second) it is almost impossible and improbable that a group of domain experts will catch any significant number of the faults and take necessary proactive measures to prevent a network failure. The problems appear continuously, but the expert never gets the opportunity to completely analyze the scenario prior to the next occurrence. This leads to an incomplete knowledge base of failures and proactive actions by the domain expert.
The newer (next generation) alarm correlation systems have incorporated many software methodologies and concepts to systematically search alarm databases, problem ticket databases, etc. and extract patterns not seen by the domain experts. The next generation alarm correlation solutions employed data mining (DM) techniques to identify and learn patterns (commonly referred to as episode rules in data mining terminology). Analogous to the phase where the fault patterns are extracted from domain experts in traditional systems, DM techniques extract fault patterns from alarm databases. These rules are then passed through system experts to see if any are redundant, or if it is a significant rule. The resulting patterns are then fed in as rules to an expert system.
Several such solutions have been advanced including sophisticated systems that consist of both traditional and non-traditional alarm correlation systems. These includes: (i) the Telecommunications Alarm Sequence Analyzer (TASA) by NOKIA; (ii) ANSWER and ECXpert, two tools created by ATandT; and (iii) IMPACT by GTE.
Many of the current approaches for alarm correlation depend on the expertise of a domain expert to provide the observed network fault patterns. However, as previously discussed, it is not sufficient to depend on domain expertise alone. The rapidly evolving networks continuously alter the existing network topology with the addition of new network elements, new software loads, and network connections. These scenarios pose a serious threat to the expert""s knowledge, which to a great extent relies on seeing a pattern over and over again. This opens the field for the utilization of systems that are capable of assisting the domain experts in identifying fault patterns.
It is therefore desirable to have a system for dynamically handling faults in a telecommunications network that is capable of discovering, learning and predicting the recurrent patterns of faults of a network as well as being capable of providing precautionary action. It would be further desirable to have a network alarm correlation system that dynamically and systematically discovers alarm correlation rules that enables root cause analysis, fault prediction and proactive maintenance.
It is therefore one object of the present invention to provide an improved telecommunications network.
It is another object of the present invention to provide a method and system for identifying fault patterns as they occur in telecommunications network.
It is yet another object of the present invention to provide a method and system for fault prediction and proactive maintenance in a telecommunications network.
The foregoing objects are achieved as is now described. A system for proactive maintenance of a telecommunications network is disclosed. A database is created containing characteristics (parameters) of a plurality of valid logs. These valid logs represent alarms within a network that report status and abnormalities in the network and which have been specifically selected by a network domain expert or administrator from a larger group of logs. The characteristics correspond to a pattern of network fault parameters. The network is monitored for occurrences of a valid log within the telecommunications network. Upon occurrence of a valid log, a fault occurrence is predicted based on an analysis of the valid log and the characteristics found in the database.
In the preferred embodiment, the creation of the database of valid logs is a static function completed by a backtracking algorithm and the network administrators utilizing known operational measurements (OM) and other information. Prediction occurs dynamically once a valid log is detected. Corrective steps are taken and the network administrator is alerted to those steps which may not be sufficient to correct the pending fault.
The above as well as additional objects, features, and advantages of the present invention will become apparent in the following detailed written description.