1. Field of the Invention
This invention pertains to the arts of computer network management, and especially to the management of network bandwidth consumed by network management, status, and maintenance messages. More particularly, this invention relates to the arts of intelligent processing and diagnosis of network failures and problems based on fault analysis logic to more accurately detect and isolate computer network problems, to minimize the network bandwidth consumed by maintenance messages, and to effectively notify maintenance personnel of the most likely point of failure.
2. Description of the Related Art
Computer networks, such as local area networks (xe2x80x9cLANxe2x80x9d), wide-area networks (xe2x80x9cWANxe2x80x9d), intranets and the Internet typically include substantial maintenance and monitoring capabilities. Modern telephone networks, such as Signalling System 7 (xe2x80x9cSS7), Integrated Services Data Network (xe2x80x9cISDNxe2x80x9d), and many digital cellular networks including GSM, also include substantial equipment and software which are dedicated to the provisioning, monitoring and maintenance of the network and its equipment. All of the above named networks are packet-based networks, and are well-known within their respective arts.
Key to the architecture and operation of these networks are packet routers, which interconnect multiple physical networks and provide routing and forwarding of packets, or xe2x80x9cmessagesxe2x80x9d, from one network to another based upon addressing schemes defined by well-known protocols such as the Internet Protocol (xe2x80x9cIPxe2x80x9d) or LAPD for SS7 and ISDN. These addressing schemes can be generalized as schemes which define each data packet or message has having a header, payload, and tail. The destination address, origination address, packet sequence number, and payload size are typically included in the header section of the message. The payload section contains the actual computer data which is being transferred from one computer to another via the computer network, which may represent a portion of a computer file, a formatted message, or a section of digitized signal such as voice, video or other audio. The various message formats are defined by well-known standards promulgated by InterNIC, the International Telecommunications Union, Bellcore, and ANSI.
In order to manage these networks, including monitoring of network operation status, configuring and re-configuring network elements (routers, terminals and switches), and provisioning of new network sections, a number of well-known software and hardware products have been developed and placed on the market. Most of these products integrate specialized software onto network server platforms. The software uses the network connectivity and bandwidth provided by the network server platform to perform maintenance testing, messaging, status checking, and alert messaging. Many times, the actual network being used for xe2x80x9crealxe2x80x9d traffic, such as computer file transmission or telephone call transmission, is used for the maintenance communications as well. In this case, the maintenance messages xe2x80x9cmix inxe2x80x9d with the bandwidth of the xe2x80x9crealxe2x80x9d traffic. As such, if maintenance messages accumulate to significant bandwidth consumption, network performance may be adversely affected. In other cases, separate networks dedicated to maintenance may be configured to avoid this problem. But, even so, if maintenance messages exceed an expected bandwidth level, the dedicated maintenance network may fail.
When network management software like Netview/6000 or Hewlett-Packard""s OpenView and others, detects a network device such as a router has gone off-line, it will send xe2x80x9cnode downxe2x80x9d events or messages for all the workstations connected downstream from off-line router to network problem management server. The network problem management server provides correlation and processing for opening trouble tickets, and eventually, it send alerts to appropriate maintenance personnel thru pagers, e-mail, and/or telephone calls.
FIG. 1 shows the topology of prior art maintenance systems. A router (1) may have multiple ports to multiple networks. Each port is serviced by a network interface card (xe2x80x9cNICxe2x80x9d), such as an Ethernet LAN interface card. FIG. 1 shows an example of a router serving three networks, A, B, and C, each of which is a group of networked computer workstations or personal computers. For example, network A (5) has several xe2x80x9cdropsxe2x80x9d to computers, and one drop or connection (6) to the router. Likewise, network B (4) is connected (3) to the router, and network C (2) is connected (7) to the router. Packets or messages received by the router are forwarded to other networks based on the addressing scheme of the network, such as IP in the case of many computer networks.
Also shown in FIG. 1 is a connection (8) to a maintenance server (9) such as a NetView 6000 server. In this example, this connection (8) connects to the router (1) using the router""s NIC for network D. The maintenance server (9) typically contains a connectivity database which contains all of the network addresses of all the elements on the other networks connected to the router, such as all the computers connected to networks A, B, and C. Using this database, the maintenance server (8) periodically sends status query messages, or xe2x80x9cpingsxe2x80x9d, to each of the computers. If each computer is on-line, the router is functioning properly, and the network physical media (cable, RF links, etc.) is in tact, a reply will be received from each computer nearly immediately in response to the xe2x80x9cpingxe2x80x9d. If a reply or response is not received within a certain time from transmitting of the xe2x80x9cpingxe2x80x9d, the maintenance server (9) may assume a problem with the computer, router, or network(s) exists.
For example, if all computers and the router are functioning correctly except for one computer, then only one response will not be received, and all other responses will be received. However, if the router fails, no responses will be received from any of the computers. In the most basic of maintenance system configurations such as the basic NetView 6000 product, this scenario can result in a storm of events being sent to the problem management server which correlates events and opens trouble tickets, leading to many useless and/or redundant e-mails and pagers.
FIG. 2 illustrates this scenario. A normal xe2x80x9cpingxe2x80x9d (20) is forwarded from the NetView 6000 to the router, which forwards (21) it to the appropriate PC. The PC, if functioning properly, replies (22) via the router to the NetView 6000 (23) within a predetermined time limit t1. If the router has failed, the xe2x80x9cpingxe2x80x9d (24) will not be replied to by any of the computers within time t1, which will result in the NetView 6000 sending multiple xe2x80x9ccomputer downxe2x80x9d messages (25) to the problem management server. The problem management server is configured to wait a period of time t3 before escalating the event to notification of the maintenance personnel, in order to reduce the number of alerts made for temporary problems such as power glitches, computer reboots, etc. But, if no xe2x80x9ccomputer upxe2x80x9d messages are received within time limit t3, the problem management server will send multiple pager messages and telephone calls, and may open multiple trouble tickets (26), as many as one per computer on the network. This results the in the alerting of the maintenance personnel, but is confusing to the personnel as to which element is actually failed, Additionally, the network link between the NetView 6000 server and the problem management server has suffered unnecessary bandwidth consumption by all of the xe2x80x9ccomputer downxe2x80x9d messages.
In an enhancement of the prior art network management technology, a product called Tivoli for Network Connectivity module (TFNC) by International Business Machines (xe2x80x9cIBMxe2x80x9d) employs similar concept, but it adds some intelligent processing to the maintenance server. With TFNC, all of the original xe2x80x9ccomputer downxe2x80x9d messages will be sent to the problem management server, but, as shown in FIG. 3, the Tivoli processing (30) will examine the network topology and determine that all of these failures are likely due to a single point failure, namely a router failure. So, within the escalation time period t3, TFNC will send multiple xe2x80x9ccomputer upxe2x80x9d messages (31) to the problem management server, which results in a net status of only the xe2x80x9crouter downxe2x80x9d message being escalated by the problem management server. While this enhancement to the network maintenance technology produces a desirable reduction in the number of alerts (pager messages, trouble tickets, etc.) (32) issued to maintenance personnel, it does not reduce the bandwidth consumed by the messages on the network between the maintenance server (TFNC and NetView 6000). Rather, it nearly doubles the bandwidth consumption.
Therefore, there is a need in the art for a system and method which intelligently processing the xe2x80x9cpingxe2x80x9d response pattern in a timely manner, and which issues a minimal number of xe2x80x9cnetwork element downxe2x80x9d messages which precisely isolate the most likely point of failure in order to minimize network bandwidth consumption, and to minimize redundant and incorrect maintenance alerts.