1. Technical Field of the Invention
The present invention relates in general to the communications field and, in particular, to a system and method for distributing control of the fault management functions throughout a communications network.
2. Description of Related Art
The tasks of managing and controlling the performance of distributed communications networks (e.g., distributed data networks or distributed telecommunications networks) are becoming increasingly complex due to a number of crucial factors, such as, for example, the increased complexity, dynamism and diversity of the network technologies, the spread of advanced services with very distinct requirements (e.g., live video, file transfers, etc.), and the heightened expectations of the users being served. Other crucial factors that impact network complexity are the progressive deregulation of the telecommunications industry, and the highly competitive market that has emerged as a result.
In order to survive in such an environment, a distributed communications network operator must manage the network so that its utilization is maximized (i.e., ensure a maximum return on the investment), while ensuring that all offered services perform within expected bounds. In order to perform such tasks, the operator""s personnel need certain support tools that help them to manage the tasks with their complexities. In particular, certain distributed, dynamically changing networks, such as, for example, the next generation Internet and so-called third generation mobile communication networks will require a level of operational support that is not provided by today""s support systems.
Operation and Support Systems (OSS) typically function to perform routine support tasks in data communications and telecommunications systems, such as, for example, traffic measurements, network supervision and performance management, analyses, fault diagnoses, administrative tasks, etc. The current approach used for network performance and fault management in the OSS industry typically involves a number of applications residing on a software platform. The software platform usually supports separate applications for monitoring network performance information, managing alarm conditions, and handling of common functions in order to initiate management operations for network resources. Normally, these applications are not integrated to a great extent, other than that they share the same platform facilities. Consequently, it is the operator who has to correlate the performance and alarm information, and where necessary, decide what actions are appropriate to take with regard to improving network performance.
As such, most of the support systems involved are centralized in a single, monolithic management center, or in some cases, distributed or spread across a relatively small number of geographically distinct management centers. In some of the distributed system cases, the main reason for the distribution is the distributed nature of the responsibilities in the corporate organizations involved.
Currently, in a typical telecommunication system, the network element of the system gathers statistics about the traffic it is handling over a five or fifteen minute interval. The network element then makes this information available to the system as an output file, or stores it locally for later retrieval. As such, two of the original motives for structuring the telecommunication system performance measurement activities in this way were to minimize the sheer volume of information generated, and reduce the network element""s processor load. Typically, the performance information is retrieved by a network element""s management system, and stored in a database from which performance reports can be generated, either periodically or on demand. The result, however, is that network performance information is not available in real time.
Detailed fault information (e.g., alarms), for both hardware and software faults, is also gathered by the various network elements and is sent up to a centralized fault management node, which is responsible for alarm filtering and alarm correlation. The central fault management node is also used to suggest actions to correct or to otherwise reduce the effect of the faults in response to an alarm or to a combination of alarms. In some cases, more or less intricate knowledge-based systems are sometimes designed to aid the operator with fault diagnosis. Existing fault management systems, however, generally rely upon operator input and are incapable of automatically correcting the faults or of reconfiguring the managed system, if needed. Moreover, because fault information is not available in real time and because the fault management process relies upon operator input, fault management systems are generally unable to react to and handle faults in real time.
Data and telecommunication networks are becoming increasingly complex to manage in terms of their scale, the diversity of the networks and services they provide, and the resulting voluminous amount of information that must be handled by the fault management system. In order to address these complexities, certain semi-automated and automated fault management solutions will be needed to support a network operator""s staff. Such support capabilities actually do not exist (to any significant extent) in the fault management solutions provided today.
Specifically, today""s fault management systems effectively introduce an inherent latency or delay in the availability of alarms and other fault information. Consequently, these delays effectively limit the ability of network managers to respond to faults within their networks. Clearly, in operating dynamic telecommunication networks such as cellular networks, Internets, and broadband multi-media networks, these delays in identifying and resolving network faults are unacceptable. Furthermore, as the network fault management systems become increasingly automated, such delays in the delivery of fault information will become increasingly unacceptable. Instead, the fault detection intervals used should be dictated by the timing requirements of the problem domain, rather than by the solutions the network elements provide today.
In addition, today""s telecommunication network management systems are deployed in a relatively small number of locations in the network. In other words, the fault management functions are centralized in a small number of network nodes. Although it might theoretically be possible to build real-time capabilities into a centralized management system, there are some problems that would exist in such a system. First, unacceptably large amounts of bandwidth is consumed by the alarm information that must be sent to the highest level of the fault management system. The large volume of alarm data that can be generated as the size of the communications system increases will also tend to cause the central processing of such data to become slow. Another problem with maintaining all of the fault management functions at a fully centralized operation and management (OandM) system is that the system lacks robustness; if the centralized OandM system breaks down, handling of fault management tasks will be suspended.
The present invention comprises a system and method for performing distributed fault management functions in a communications network. The communications network includes a plurality of nodes. In a cellular telecommunications network, for example, the nodes are usually arranged in a hierarchy and can comprise physical devices, such as base stations, radio network controllers, radio network managers, and the like. In accordance with the present invention, each node generally includes a fault agent and an associated configuration agent. In addition, each node can also be used to supervise one or more network resources, which can comprise logical resources (e.g., a cell border, a location area, etc.) and/or physical devices.
As faults are detected in the communications network, alarms are sent from one node to the next by the fault agents that reside in the various nodes. In particular, when a fault agent receives alarm data, either from a subordinate fault agent or from a network resource, the fault agent analyzes the received alarm data to identify a cause of the alarm and to determine if the underlying fault that caused the alarm can be handled at the current node. If not, then the fault agent produces a new alarm, which summarizes the received alarm data, and passes the new alarm to an interconnected fault agent.
Once the alarm data reaches a fault agent at which the underlying fault can be handled, the fault agent forwards the fault information to the associated configuration agent. The configuration agent processes the fault information to generate reconfiguration data for correcting or otherwise reducing the effect of the fault and sends the reconfiguration data to a network resource (ordering it to perform some action), or to at least one subordinate configuration agent, depending on which nodes are affected by the proposed reconfiguration. Each subordinate configuration agent that receives the reconfiguration data generally performs further processing to generate more detailed reconfiguration data, which is then passed on to even lower level configuration agents or to underlying network resources. This process repeats until the reconfiguration is fully implemented.
Preferably, the fault agents include an event receiver for receiving alarm data from subordinate fault agents or from underlying network resources and an event generator for performing alarm correlation and alarm filtering functions to identify the cause of the alarm (i.e., the underlying fault). When the event generator identifies the cause of a particular alarm or set of alarms, the event generator updates the fault information in an event database, and, as a result, the event dispatcher sends the fault information to the associated configuration agent (if it is determined that the current node can handle the fault) or to a supervising fault agent, as higher level alarm data (if it is determined that the current node cannot handle the fault).
Similarly, the configuration agents include an event receiver for receiving both fault information from associated fault agents and high level reconfiguration data from higher level configuration agents. The configuration agents further include an event generator for generating the reconfiguration data necessary to reduce the effect of a detected fault and for generating more detailed reconfiguration data from the high level data that is received from supervising configuration agents. Once the reconfiguration data is generated, the event generator updates the configuration information that is stored in an event database, and, as a result, the event dispatcher sends the updated configuration information to the subordinate configuration agents and/or to underlying network resources for implementation.