This invention relates generally to communication network management and, more particularly, to a system and method of scaling the management functions in an expanding communications network by replicating functionally complete subsystems of a fixed maximum size. The simple replication process permits expansion of the network without changing the scope of subsystem responsibilities.
Modern communication networks can be composed of millions of functional elements, which can be hardware such as switches or multiplexers, geographically dispersed across thousands of miles of service territory. Managing such a network means providing for redundant call routing and responding to local emergencies. It is well known for a communications network to tightly monitor the individual phones, switch elements, relays, base station, and the like. Monitoring the communication network elements yields information concerning the health, maintenance, current activity, performance, and security of these elements. Such information is collected at the local levels in the network, processed, and analyzed at higher levels of management.
Additionally, the monitoring and diagnostic functions of communication network elements can be organized along specialized areas of focus, or network management tasks. For optimum performance, the information should efficiently summarize activity occurring at local levels in the network for use by administrators who manage the communications network from a regional or national perspective. It can be difficult to coordinate all the areas of narrowed focus into a comprehensive picture of network problems at the higher levels. The administrator has the difficult task of analyzing problems occurring to network elements (NE)s through whatever filtering or processing functions the network imposes between the administrator and the NEs.
The International Telecommunications UnionTelecommunications Standardization Sector (ITU-T) Telecommunications Management Network (TMN) suggests a five-layer management structure. The lowest level is the Network Element Layer (NEL), including switches and transmission distribution equipment. Above the NEL is the Element Management layer (EML) which manages the lower level elements, dealing with the issues such as capacity and congestion. The Network Management Level (NML) is concerned with managing the communication network systems associated with the NEL and EML. The Service Management Layer (SML) manages the services that are offered to the customers of the network, while the Business Management Layer (BML) on top manages the business and set goals with respect to the customer and government agencies.
Networks are typically composed of NEs from a large variety of different vendors. Therefore, there are a variety of Element Management Systems (EMS) to support communications with the NE types. The Network Management System (NMS) must interface with divergent EMS level equipment and protocols. It is the NMS systems that are responsible for controlling the communications network and keeping it functioning on a day-to-day basis. Network management can be briefly described as the task of command, control and monitoring of the network.
The ITU-T also divides management into five Operations Support Systems (OSS) areas of interest. They are: Fault Management; Configurations Management; Account Management; Performance Management; and Security Management, which are collectively referred to as FCAPS. As is well understood in the art, Fault Management is concerned with detecting network equipment problems, responding to detected problems, fixing the problems, and putting the network back into working order. Fault monitoring is usually done by receiving events from lower levels in the network indicating a fault and processing these events. This task can be very complex for large networks due to the relationships between the network elements, such as remote telephones, and the very high rate of events that must be handled. Software systems must be designed and built to handle these large data streams and provide effective fault management features.
Configuration Management is concerned with databases, backup systems, and provisioning and enablement of new network resources. That is, Configuration Management is the task of configuring the network to provide services between the various network elements. Configuring the network involves sending messages to the network elements, which set parameter values which permit signal paths to be established between elements, and controlling the behavior of these elements. The nature of modern networks makes this a complex task best handled by software.
Account Management bills the network customer for services rendered. Account Management is the task of collecting the record of services used by network elements. Usage information generates billing data that makes up the revenue stream for the service provider.
Performance Management is concerned with collecting and analyzing data that indicates how well the system is working. Performance Management involves collecting information from the network elements, which act as a measure of network performance. This “quality” measurement is critical for service providers as it defines how well they are providing service to their customers. This task is typically achieved by directly polling network elements, or otherwise receiving events from elements which convey such data.
Security Management controls and enables NE functions. Security Management is the task of managing security, including authentication and encryption, in the services provided to the end customer. Portions of each FCAPS function are performed at every layer of the TMN architecture.
The Fault Management System is one of the most critical systems in the network to control. Intelligent NEs, able to perform self-diagnosis, may provide a precise error message to the NMS. However, many NEs merely send an alarm when a problem occurs. These problems include switch failures, loss of power, line failure, and loss of RF coverage (for wireless systems). The NMS system collects the alarm data for analysis. For example, an analysis could be performed to determine a common failure mode among NEs in close physical proximity. The NMS could then issue a repair directive in response to the analysis. Intruder detection and interlock switch detection are examples of some security management issues that could be reported to the NMS by NEs.
Modern networks are both large and complex, and require the use of software for their management. A NMS describes the conglomeration of hardware and software functions required to manage and control large voice and data communication networks. NMS systems are also used for the control and provisioning of heterogeneous networks. The design of the NMS software typically follows the functional areas outlined above. Today's NMS are typically distributed systems using multiple software processes running on multiple workstations to handle the various areas of management.
FIG. 1 shows the block diagram of a typical NMS (prior art). As the figure indicates, the NMS components typically send messages to each other to accomplish the management task. They also receive events from the network over an event channel. This channel itself is a software entity like any of the other functional pieces.
The NMS is a very critical piece of the entire communications. It is the main tool for the service provider to ensure that the network is performing optimally, and that the customers are happy with the service they receive. The system must also permit rapid configuration of the network when new customers are added. All these tasks must be performed at the highest levels of performance and quality, even as the network grows in size. Service providers spend large amounts of money to come up with solutions that meet their needs. However, the task of designing and building highly scalable NMS is a very challenging one.
Designing and building a good, highly scalable, NMS is not an exact science. There are two main reasons for this. First, the traffic patterns of very large and complex network cannot be easily modeled. Second, the traffic patterns of large and complex network cannot be accurately simulated in a lab. Therefore, NMS designers must provide solutions for problems that are not well defined or easily modeled. Gross assumptions must be made on how the network will scale in size, and what effect this scale has on the network management tasks. A design strategy must be adopted based on these assumptions. When these systems are deployed in the field, many of the assumptions turn out to be erroneous, resulting in poor performance of the NMS.
As a result of a poorly performing NMS, the service provider is hurt in two ways. First, the customer experiences the dissatisfaction of interfacing with a poorly performing system. Potentially, customers can be lost if service is inadequate. Second, the service provider receives a poor return on their substantial investment in the NMS.
Apart from building the NMS on flawed assumptions, NMS designers can make design choices which exacerbate the problem. In some network designs, the cost of hardware can be cheaper than software, when the development and maintenance costs of the software are factored in. Regardless of the design philosophy, network expenditures are rarely viable if the underlying characterizations of the problems are inaccurate.
When analyzing the NMS design to meet the issue of scalability, the key issue is how well the network will perform as the number of system elements increase. Designers must make decisions on which component pieces of the system will be the least scalable. These potentially unscalable pieces are typically replicated, and multiple copies of that process are prepared.
FIG. 2 illustrates an example of system function that is replicated to address the issue of scalability (prior art). For example, if the Fault Management (FM) process is considered to be the least scalable piece of the system, a decision may be made which divides the network to manage across some logical boundary and run multiple instances of the FM, with each FM being assigned to a different division of the network. However, all the other processes needed to interact with a FM must now be designed to be aware of the fact that there are multiple copies of the FM. A complicated policy of routing requests to different FM modules in the network is required. Further, framework must be put in place to inform these processes when additional instances of FM are started to handle network load. This makes the overall design of the system more complex. This complexity also makes the testing of the design more difficult and error prone.
In the above example, an assumption was made to make the FM the unit of replication, in response to the increased system size. If the assumption is wrong, then the original problem of scalability remains unaddressed, causing a very poor return on investment for NMS system expenditures.
In the example presented above, the FM may potentially be multi-threaded to increase its performance. As is well known, multi-threading permits an operating system to simultaneously execute different parts (threads) of a program. Software multi-threading is another common technique employed to increase load handling capacity. However, it is difficult to runs threads simultaneously without interference, and multi-threading is not always practical if incorrect assumptions are made in the analysis phase.
Multi-threading is a powerful technique but comes at a large cost. Designing and developing multi-threaded software is acknowledged by the industry and academia to be a very complex task. The resulting software is very hard to test completely. Further, the number of software developers that have the skill set to write multi-threaded software is very limited. Such designers are typically senior, at the high end of the pay scale. In many cases, multi-threading is not a safe option, as when the software has been developed by a third party.
It would be advantageous if a method could be developed of scaling a communications network to a larger size without having to redesign or otherwise modify the network management functions.
It would be advantageous if an NMS could be grown to a larger size using the same functional subsystems that were developed for the original NMS.
It would be advantageous if network management functions could be updated or tested in small manageable sections, so that the entire NMS did not have to be shut down or modified.