The present invention relates to networks, and more particularly, to a system and method for troubleshooting a network.
Recent years have seen an explosion in the demand for a variety of network applications and services. For example, as more and more users connect their personal computer systems to the internet, there is an ever increasing demand placed on the various networks that are used to support the evolving functionality of the internet. For another example, there is also an ever increasing demand placed on networks used in the telecommunications industry as the industry expands functionality to include carrying both voice and data across telecommunications networks.
Generally, networks typically comprise a number of data processing elements connected together in a variety of configurations to allow communication of information between the elements of the network and across different network groups. The data processing elements in a network may include client computer systems, server computer systems, routers, bridges, telecommunication equipment, or optical communication equipment to name just a few. Furthermore, advanced data processing elements may further comprise both hardware and software subsystems such as, for example, power supply subsystems and hardware and/or software data communication subsystems. The data processing elements, which may also be referred to as network elements, may be connected together in a network in a variety of configurations to receive, transmit, and/or process various types of information, such as voice information, video information, or generalized data, for example.
To meet the ever increasing demands for performance and functionality, network architectures, networks elements, and network element subsystems have grown in complexity. However, as the complexity of networks has increased, the complexity and burden of managing, troubleshooting, and correcting software and hardware faults across the network has also increased. For example, when a subsystem in a network element fails (hardware or software), the impact on the network not be immediately evident, but the failure may eventually lead to a critical error under certain later encountered network conditions. Such errors may include the loss of data, a complete network failure, or even possible damage to the equipment. On the other hand, some faults may be less critical, and may only result in the loss of certain functionality or a reduction in performance of the network.
Furthermore, as the complexity of the systems has increased, the quantity and nature of the potential faults has also increased to a level that can be unmanageable. In modern network systems, the number of potential faults that can occur in a system can make it extremely difficult to determine critical faults from noncritical faults. Moreover, increased complexity also makes fault correction (i.e., maintenance and/or repair) extremely burdensome. For example, in an optical network, each of the network elements may be hundreds of kilometers apart. Thus, the inability to quickly and accurately identify and diagnose a network fault can force maintenance technicians or engineers to make potentially many repeated trips across large distances in order to address and eliminate the fault, thereby leading to financial detriments from costly network downtime or increased maintenance expenses, or both.
Thus, it is important for network administrators to be able to quickly diagnose problems in the network as they arise. Accordingly, what is needed is an improved system and method for troubleshooting a network.
Embodiments of the present invention provide an improved system and method for troubleshooting a network. In one embodiment, a system for troubleshooting a network comprises a plurality of network elements coupled together to communicate information across the network, a management station coupled to a first network element in the plurality of network elements, a plurality of network element subsystems in each network element, at least one of the plurality of network element subsystems in each network element generating a subsystem alarm in response to a subsystem fault condition, and a distributed network operating system including a plurality of subsystem applications on each network element, at least one of the subsystem applications on each network element, executable on a corresponding subsystem, generating an application alarm in response to receiving the subsystem alarm. The management station signals the distributed network operating system to transmit application alarms and subsystem alarms from each network element, across the network to the first network element, and to the management station for display to a user.
In another embodiment, a method of troubleshooting a plurality of network elements in a network under control of a distributed network operating system is provided. Each network element includes a plurality of network element subsystems. The method comprises generating a subsystem alarm for at least one of the plurality of network element subsystems in response to a corresponding subsystem fault, generating an application alarm in a subsystem application, executable on a corresponding network element subsystem, in response to the subsystem alarm, associating text with the application alarm, the associated text describing the subsystem fault, and transmitting the subsystem alarm, the application alarm, and the associated text across the network to a management station coupled to one of the network elements for display to a user.
The following detailed description and accompanying drawings provide a better understanding of the nature and advantages of the present invention.