The present invention pertains generally to the field of high-availability networks. More specifically, the present invention pertains to methods and systems for measuring and improving availability of networks.
The Internet is a worldwide collection of computer networks and gateways that generally use the TCP/IP suite of protocols to communicate with one another. The Internet allows easy access to media and data from a variety of sources, and is capable of delivering this information to the users wherever they may be. Some of the myriad functions possible over the Internet include sending and receiving electronic mail (e-mail) messages, logging into and participating in live discussions, playing games in real-time, viewing pictures, watching streaming video, listening to music, going shopping on-line, browsing different web sites, downloading and/or uploading files, etc.
Communication networks, such as the Internet, typically consist of routers and switches that route data and messages from one node of the network to another. Multiple paths usually exist between any two nodes of the network. Thus, even when some nodes of the network malfunction (e.g., go down), communications across the network generally remain unaffected. Unfortunately, however, even for the most perfectly designed networks, communications across parts of the network may break down when critical nodes malfunction.
In order to improve network reliability and stability, network engineers must be able to quantify and measure network availability. After the availability of the network is quantified and measured, network engineers can then make recommendations as to how to improve it. Measuring true network availability, however, requires constant monitoring of successful transmission from each point in the network to every other. Thus, it is impractical and infeasible to measure true network availability for large-scale networks that have hundreds or thousands of nodes.
Numerous methods of quantifying and measuring availability of a large network without requiring the monitoring of true network availability have been devised. One method is to measure the number of user minutes impacted by network outages over a certain period of time. In that method, availability is defined as a ratio of total impacted user minutes, or total non-impacted user minutes, divided by total user minutes for a certain period of time. One advantage of that method is that it can be used to measure availability of any network. However, that method requires impacted user minutes to be tabulated manually or programmatically by reviewing all help-desk trouble tickets. For a large network with a large number of users, the task of reviewing all help-desk trouble tickets can be very time consuming and costly. Further, the measurement of impacted user minutes by reviewing trouble tickets is, at least, partially subjective.
Another method of measuring availability of a network is to track the availability of all devices of the network. FIG. 1 illustrates the prior art method of measuring network availability based on device availability. As illustrated, the device availability method entails the use of a network monitoring device 110 to send pings to all devices 120 (e.g., routers, switches) within the network 100 to determine availability. The pings are sent at defined intervals and overall availability is determined by the ratio of pings returned divided by pings sent. The method, however, does not account for any redundancy in network design. Therefore, periods of non-availability may be counted when they are not user impacting. In fact, estimated availability may be slightly higher in non-redundant networks than in high-availability redundant networks. Another disadvantage is that, because every single device/interface in the network requires monitoring, high traffic load within the network may result. In addition, device availability does not reflect user perception of network availability because users generally perceive outages as loss of connectivity to network resources.
Yet another method of measuring availability of a network is to calculate, on average, how long links have been up in the network using SNMP (Simple Network Management Protocol). Network availability may then be calculated for all trunk and backbone links by averaging the amount of link uptime for the trunks. Server availability may be calculated by link availability on all server ports. That method can also be used on all types of networks. However, a disadvantage of that method is that network availability is determined based on averaging, and is therefore less accurate. More importantly, link status does not reflect routing problems, which are some of the most common causes of network outages.
Yet another method of measuring availability of a network involves measuring application availability. Application availability measurement is done by making an OSI (Open Systems Interconnection) level 7 application call to an application server and waiting for an appropriate response. Availability is determined by the ratio of successful receipt of information for the periodic calls. An advantage of gauging network availability with application availability is that that method can be used for any type of network. However, a significant drawback is that it may not measure the availability of networks alone. For instance, application errors may be falsely attributed to errors in the network. Furthermore, application availability measurement is not scaleable.
Therefore, what is needed is a novel method and system for measuring availability that does not have limitations of the above mentioned techniques. What is further needed is a novel standard for defining network availability that can be used as basis of service level agreements (SLAs). What is yet further needed is a method and system for assisting network engineers in identifying root causes of network problems in a quick and accurate manner such that availability of a network can be improved.
Accordingly, the present invention provides a method and system for measuring availability of a high-availability network that is scaleable, accurate and objective. The present invention provides a standard measurement method that can be used as an industry standard for comparing stability of networks. The present invention also provides an auto-annotation mechanism that creates failure records correlating network availability data and other device activity data such that the root cause of network problems can be quickly identified and resolved.
One embodiment of the present invention provides a network availability monitoring device for coupling to a core segment of a network. In the present embodiment, a leaf node detection (or, edge node detection) process is first carried out by the network availability monitoring device to determine the leaf nodes (also known as xe2x80x9cedge devicesxe2x80x9d) of the network. Then, the network availability monitoring device sends test packets (e.g., ICMP xe2x80x9cInternet Control Message Protocolxe2x80x9d pings for IP networks) at regular intervals to the leaf nodes to determine their availability. Test packets, however, are not targeted at non-leaf nodes (also known as xe2x80x9cnon-edge devicesxe2x80x9d). Network availability for the network as a whole is then determined based on the total number of test packets sent to the leaf nodes and the total number of returned responses. It should be noted that, in the present embodiment, availability of non-leaf nodes does not directly affect the calculation of network availability. For instance, the network availability monitoring device may not report a malfunctioning intermediary device as non-availability if the leaf nodes are not affected.
One embodiment of the present invention employs an automatic rule-based leaf node detection (or, edge node detection) process that detects leaf nodes in the network. Automatic edge-detection allows administrators to easily create availability groups and measure network availability. Without automatic edge-detection, administrators would face a major labor-intensive task of identifying edge IP addresses.
In accordance with another embodiment of the present invention, leaf nodes that share similar geographic, topology or service characteristics may be placed in a common availability group. Availability groups are valuable because they allow the organization to measure different specific areas of the network that typically have different availability needs or different support requirements. For instance, all LAN leaf nodes may be placed in a LAN availability group, and all WAN leaf nodes may be placed in a WAN availability group. Network availability is determined for the availability group based on the number of test packets sent to the availability group and the number of test packet responses received. In one embodiment, network availability for an availability group is calculated by averaging all the availability for leaf nodes within the group during a certain data collection period.
The present invention also provides mechanisms that allow network managers or network service providers to perform quality improvements within a network. Particularly, the present invention provides an auto-annotation process that helps identify the root cause of a network problem based on a current set of network availability information. In one embodiment, auto-annotation includes the steps of creating a failure record for each period of non-availability for a leaf node device, and xe2x80x9cannotatingxe2x80x9d the record with relevant network management information that is useful for root cause analysis. Identifying the root cause of a network problem facilitates the debugging and trouble-shooting of the network problem, and provides an excellent resource for network engineers in preventing network outages.
Embodiments of the present invention include the above and further include a computer-readable medium having contained therein computer-readable codes for causing a computer system to perform a method of monitoring availability of a network. The method includes the steps of: (a) determining the leaf nodes and non-leaf nodes of the network; (b) monitoring availability of the leaf nodes; and (c) generating network availability data for the network as a whole based on availability of the leaf nodes without monitoring availability of the non-leaf nodes.